US20120102367A1 - Scalable Prediction Failure Analysis For Memory Used In Modern Computers - Google Patents

Scalable Prediction Failure Analysis For Memory Used In Modern Computers Download PDF

Info

Publication number
US20120102367A1
US20120102367A1 US12/912,735 US91273510A US2012102367A1 US 20120102367 A1 US20120102367 A1 US 20120102367A1 US 91273510 A US91273510 A US 91273510A US 2012102367 A1 US2012102367 A1 US 2012102367A1
Authority
US
United States
Prior art keywords
memory
program instructions
single bit
computer readable
error value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/912,735
Inventor
Tu T. Dang
Michael C. Elles
Juan Q. Hernandez
Dwayne A. Lowe
Challis L. Purrington
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US12/912,735 priority Critical patent/US20120102367A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ELLIS, MICHAEL C., LOWE, DWAYNE A., PURRINGTON, CHALLIS L., HERNANDEZ, JUAN Q., DANG, TU T.
Publication of US20120102367A1 publication Critical patent/US20120102367A1/en
Priority to US14/011,222 priority patent/US9196383B2/en
Priority to US14/823,384 priority patent/US20150347211A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/076Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/50Marginal testing, e.g. race, voltage or current testing
    • G11C29/50004Marginal testing, e.g. race, voltage or current testing of threshold voltage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test

Definitions

  • Embodiments of the method may include gathering memory information for memory on a user computer system having at least one processor. Further, the method includes selecting one or more memory-related parameters from a plurality. Further still, the method includes calculating based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Yet further, the method includes setting, based on the calculating, the single bit error value for the user computer system.
  • the computer program product includes a computer readable storage device. Further, the computer program product includes first program instructions to gather memory information for memory on a user computer system having at least one processor. Further still, the computer program product includes second program instructions to select one or more memory-related parameters. Yet further, the computer program product includes third program instructions to calculate based on the gather and the select (i.e., performing the instructions to gather and to select), a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information.
  • the computer program product includes fourth program instructions to set, based on the calculate (i.e., performing the instructions to calculate), the single bit error value for the user computer system, wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device.
  • the system includes a processor, a computer readable memory and a computer readable storage device. Further, the system includes first program instructions to gather memory information for memory on a user computer system having at least one processor, wherein the memory may be the same, part of or different from the computer readable memory. Further still, the system includes second program instructions to select one or more memory-related parameters. Yet further, the system includes third program instructions to calculate, based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Further still, the system includes fourth program instructions to select, based on the calculate, the single bit error value for the user computer system. The first, second, third, and fourth program instructions of the system are stored on the computer readable storage device for execution by the processor via the computer readable memory.
  • FIG. 1 depicts an example embodiment of a system for scalable predictive failure analysis in accordance with this disclosure.
  • FIG. 2 depicts a block diagram of an example embodiment of a computer system suitable for scalable predictive failure analysis, such as a user computer system.
  • FIG. 3 depicts an example embodiment of a flowchart to show a method for scalable predictive failure analysis in accordance with this disclosure.
  • FIG. 4 depicts another diagram of an example embodiment of a computer system suitable for scalable predictive failure analysis, such as a user computer system.
  • Embodiments include gathering, for a user computer system, memory information, such as memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality. Calculation of the SBE value ensues through combining calculation(s) for each of the selected memory-related parameters, wherein the selecting optionally occurs subsequent or prior to the gathering. The calculated SBE value is set and valid for the user computer system until powering down or changing memory components in the user computer system. Accordingly, the SBE value is scalable because the value is determined for the particular user computer system—not simply a fixed, generic value.
  • SDRAM synchronous dynamic random access memory
  • Alerts whether audible or visible, may occur based on comparing counted SBEs to the scalable SBE value.
  • the alerts provide credible predictive failure analysis to avert system memory failures while incorporating the realities of the unique complexities for the particular user computer system.
  • routines executed to implement the embodiments of the invention may be part of a specific application, component, program, module, object, or sequence of instructions.
  • the computer program of the present invention typically is comprised of a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions.
  • programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices.
  • various programs described herein may be identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature herein is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
  • inventions of the present invention may advantageously be implemented with other substantially equivalent hardware, software systems, manual operations, or any combination of any or all of these.
  • the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
  • the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • embodiments of the invention may also be implemented via parallel processing using a parallel computing architecture, such as one using multiple discrete systems (e.g., plurality of computers, etc.) or an internal multiprocessing architecture (e.g., a single system with parallel processing capabilities).
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • aspects of embodiments of the invention described herein may be stored or distributed on computer-readable medium as well as distributed electronically over the Internet or over other networks, including wireless networks. Data structures and transmission of data (including wireless transmission) particular to aspects of the invention are also encompassed within the scope of the invention.
  • the invention can take the form of a computer program product accessible from a computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk.
  • Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements may include local memory employed during execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
  • I/O Input/output
  • I/O devices including but not limited to keyboards, displays, pointing devices, etc.
  • I/O controllers including but not limited to keyboards, displays, pointing devices, etc.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks, including wireless networks.
  • Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
  • FIG. 1 depicts a user computer system 100 having a collection of cooperating, algorithmic modules for SPFA calculations.
  • the enabling logic for modules 110 , 115 , 120 , 130 , 140 , 145 is reduced to software and/or hardware.
  • the modules 110 , 115 , 120 , 130 , 140 , 145 are located, for example, within the operating system of a user computer system 100 .
  • any of the modules 110 , 115 , 120 , 130 , 140 , 145 may be located remotely but in network communication with the user computer system 100 .
  • Example of remote location may have some of the modules 110 , 115 , 120 , 130 , 140 , 145 located on other computer systems, including manipulations and calculations of the generated data being the subject of a Web service.
  • the system 100 has accessible logic to gather memory information for memory 105 on the user computer system 100 .
  • the gathering module 110 gathers memory information, memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality for memory 105 under test on the particular user computer system 100 .
  • SDRAM synchronous dynamic random access memory
  • memory information for memory 105 could be a module size of 2 GB for a single-rank dual in-line module (DIMM).
  • DIMM dual in-line module
  • the system 100 also includes logic, denominated as a configuration module 120 in FIG. 1 , for selecting one or more memory-related parameters from a plurality of such parameters.
  • a user or administrator, for example, of the user computer system 100 selects which memory-related parameters to include in the SPFA calculations. The selecting may occur through textual entry, radial selection, or other method for selecting options through a display coupled to the user computer system 100 .
  • the selected memory-related parameters themselves, directly correlate to memory information. That is, memory information regarding memory size correlates to the memory-related parameter for memory size, memory information regarding module packaging correlates to the memory-related parameter for module packaging, and so forth.
  • the calculation module 130 includes logic to calculate a combination of the selected memory-related parameters.
  • the SPFA uses the selected number of memory-related parameters, which one considers critical to maintain a functioning memory subsystem, in order to calculate the SBE value.
  • the setting module 140 sets the calculated SBE value for the system 100 . Evaluation of exemplary memory-related parameters and combination of the same for calculation of the SBE value now ensues.
  • Memory module size is a memory-related parameter for possible inclusion in the SPFA calculation for the memory 105 .
  • the following exemplary scale is provided for a correctable SBE value based on the actual capacity of each module or module-pairs installed in the system:
  • the memory-based parameter for memory module size would allow 256 SBEs for a 2 GB DIMM, 512 SBEs for a 4 GB DIMM, 1024 SBEs for a 8 GB DIMM, 2048 SBEs for a 16 GB DIMM, and 4096 SBEs for a 32 GB DIMM before memory failure realized by visual and/or audio alert through use of the detection and comparison modules 115 , 145 .
  • SDRAM technology on the memory module 105 .
  • SDRAM technology on the memory module 105 .
  • the following exemplary scale is provided:
  • the memory-based parameter for SDRAM technology would allow 1024 SBEs for a single-rank DIMM, 823 SBEs for a dual-rank DIMM, and 640 SBEs for a quad-rank DIMM before alerting the user or another system in network communication with the system 100 of memory failure of a module or other memory device needing repair or replacement, whereupon the latter at least suggests a new SBE value should be re-set by re-calculation.
  • Still another memory-related parameter for inclusion in the calculation of the SBE value is module packaging of the memory 105 on the particular user computer system 100 .
  • the following exemplary scale is provided:
  • the memory-based parameter regarding ChipkillTM would allow 256 SBEs for x8 DIMM with no ChipkillTM, 512 SBEs for x8 DIMM with ChipkillTM is, and 640 SBEs for x4 DIMM with ChipkillTM
  • Yet another memory-related parameter for optional inclusion in the calculation of the SBE value is memory failure mode of the memory 105 on the particular user computer system 100 .
  • this memory-related parameter regards single count reduction for a single memory address. That is, a correctable SBE that occurs repeatedly at the same memory address on memory 105 DIMM is counted as one failure instead of counting the repeats as multiple failures.
  • Another example of a memory-related parameter for optional inclusion in the calculation of the SBE value is vendor quality of the memory 105 on the particular user computer system 100 .
  • the following exemplary scale is provided:
  • Table 4 represents a memory vendor quality/reliability matrix on a per product basis.
  • a memory vendor can have multiple products, each one could have a different quality/reliability rating.
  • the quality scale rating such as Table 4, may be used for calculating the SBE value.
  • a memory 105 DIMM from a lower quality score supplier yields a lower PFA threshold value for this memory-related parameter.
  • a lower quality score would require replacement or repair sooner as compared to a higher quality score provided all other contributing PFA memory-related parameters to the SBE value are constant.
  • combination of the selected, memory-related parameters may be through simple addition, multiplication, a mixture of the two, or any other combination method so as to yield a reliable, relative, and meaningful SBE value for SFPA.
  • the value of each memory-related PFA threshold and time window(s) should be defined by the subject matter expert on the system design team. That is, the illustrative tables provided herein are neither the sole nor necessarily appropriate values to use because the same are solely intended as examples.
  • this disclosure enables a selectable and scalable PFA for memory 105 that thwarts consequences of memory failures for a particular user computer system 100 .
  • FIG. 2 depicts a block diagram of one embodiment of a computer system 200 suitable for use in scalable predictive failure analysis.
  • Other possibilities for the computer system 200 are possible, including a computer having capabilities other than those ascribed herein and possibly beyond those capabilities, and they may, in other embodiments, be any combination of processing devices such as workstations, servers, mainframe computers, notebook or laptop computers, desktop computers, PDAs, mobile phones, wireless devices, set-top boxes, or the like.
  • At least certain of the components of computer system 200 may be mounted on a multi-layer planar or motherboard (which may itself be mounted on the chassis) to provide a means for electrically interconnecting the components of the computer system 200 .
  • the computer system 200 includes a processor 202 , storage 204 , memory 206 , a user interface adapter 208 , and a display adapter 210 connected to a bus 212 or other interconnect.
  • the bus 212 facilitates communication between the processor 202 and other components of the computer system 200 , as well as communication between components.
  • Processor 202 may include one or more system central processing units (CPUs) or processors to execute instructions, such as an IBM® PowerPC® processor, an Intel® Pentium® processor, an Advanced Micro Devices, Inc. processor or any other suitable processor.
  • IBM and PowerPC are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide.
  • the processor 202 may utilize storage 204 , which may be non-volatile storage such as one or more hard drives, tape drives, diskette drives, CD-ROM drive, DVD-ROM drive, or the like.
  • the processor 202 may also be connected to memory 206 via bus 212 , such as via a memory controller hub (MCH).
  • System memory 206 may include volatile memory such as random access memory (RAM) or double data rate (DDR) synchronous dynamic random access memory (SDRAM).
  • a processor 202 may execute instructions to perform functions, such as by gathering memory information and selecting memory-related parameters for inclusion for SPFA calculations. Information before, during or after calculations may temporarily or permanently be stored in storage 204 or memory 206 .
  • FIG. 3 another aspect of scalable predictive failure analysis for memory associated with a particular user computer system is disclosed.
  • Flowchart 300 is for a system, such as system 100 , notably involving the logic associated with the detection and comparison modules 115 , 145 of FIG. 1 .
  • flowchart 300 starts 305 by the system detecting 310 SBEs on a DIMM via a system management interrupt (SMI).
  • SMI system management interrupt
  • the BIOS or other BIOS implementation such as Unified Extensible Firmware Interface (UEFI)
  • UEFI Unified Extensible Firmware Interface
  • SMI is triggered to notify wake up BIOS to check 320 the memory-related parameters and SBE counts accumulated so far.
  • Decision block 330 queries whether the SBE count value is at least equal to set SBE value. If yes 340 , then the flowchart 300 issues 350 an SPFA alert and optionally provides repair actions, such as displaying a visual notice to replace the specific faulty memory module or suggests reparative procedures.
  • the flowchart 300 returns to sleep, at least until the next SBE is counted, because comparison of the counted SBEs for the particular user computer system is less than the set SBE value. Subsequent to the issuing 350 the alert with optional actions or no 335 , the flowchart ends 375 .
  • FIG. 4 illustrates information handling system 401 which is a simplified example of a computer system, such as shown in FIG. 2 for use in scalable predictive failure analysis, and capable of performing the operations described herein.
  • Computer system 401 includes processor 400 which is coupled to host bus 405 .
  • a level two (L2) cache memory 410 is also coupled to the host bus 405 .
  • Host-to-PCI bridge 415 is coupled to main memory 420 , includes cache memory and main memory control functions, and provides bus control to handle transfers among PCI bus 425 , processor 400 , L2 cache 410 , main memory 420 , and host bus 405 .
  • PCI bus 425 provides an interface for a variety of devices including, for example, LAN card 430 .
  • PCI-to-ISA bridge 435 provides bus control to handle transfers between PCI bus 425 and ISA bus 440 , universal serial bus (USB) functionality 445 , IDE device functionality 450 , power management functionality 455 , and can include other functional elements not shown, such as a real-time clock (RTC), DMA control, interrupt support, and system management bus support.
  • RTC real-time clock
  • Peripheral devices and input/output (I/O) devices can be attached to various interfaces 460 (e.g., parallel interface 462 , serial interface 464 , infrared (IR) interface 466 , keyboard interface 468 , mouse interface 470 , fixed disk (HDD) 472 , removable storage device 474 ) coupled to ISA bus 440 .
  • interfaces 460 e.g., parallel interface 462 , serial interface 464 , infrared (IR) interface 466 , keyboard interface 468 , mouse interface 470 , fixed disk (HDD) 472 , removable storage device 474
  • IR infrared
  • HDD fixed disk
  • removable storage device 474 removable storage device
  • BIOS 480 is coupled to ISA bus 440 , and incorporates the necessary processor executable code for a variety of low-level system functions and system boot functions. BIOS 480 can be stored in any computer readable medium, including magnetic storage media, optical storage media, flash memory, random access memory, read only memory, and communications media conveying signals encoding the instructions (e.g., signals from a network).
  • LAN card 430 is coupled to PCI bus 425 and to PCI-to-ISA bridge 435 .
  • modem 475 is connected to serial port 464 and PCI-to-ISA Bridge 435 .
  • FIGS. 2 and 4 are capable of executing the disclosure described herein, these computer systems are simply examples of computer systems and user computer systems. Those skilled in the art will appreciate that many other computer system designs are capable of performing the disclosure described herein.
  • FIGS. 1 and 3 Another embodiment of the disclosure is implemented as a program product for use within a device such as, for example, those systems and methods depicted in FIGS. 1 and 3 .
  • the program(s) of the program product defines functions of the embodiments (including the methods described herein) and can be contained on a variety of media including but not limited to: (i) information permanently stored on non-volatile storage-type accessible media (e.g., write and readable as well as read-only memory devices within a computer such as ROM, flash memory, CD-ROM disks readable by a CD-ROM drive); (ii) alterable information stored on writable storage-type accessible media (e.g., readable floppy disks within a diskette drive or hard-disk drive); and (iii) information conveyed to a computer through a network.
  • non-volatile storage-type accessible media e.g., write and readable as well as read-only memory devices within a computer such as ROM, flash memory
  • the latter embodiment specifically includes information downloaded onto either permanent or even sheer momentary storage-type accessible media from the World Wide Web, an internet, and/or other networks, such as those known, discussed and/or explicitly referred to herein.
  • Such data-bearing media when carrying computer-readable instructions that direct the functions of the present disclosure, represent embodiments of the present disclosure.
  • routines executed to implement the embodiments of this disclosure may be part of an operating system or a specific application, component, program, module, object, or sequence of instructions.
  • the computer program of this disclosure typically comprises a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions.
  • programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices.
  • various programs described hereinafter may be identified based upon the application for which they are implemented in a specific embodiment of this disclosure. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus this disclosure should not be limited to use solely in any specific application identified and/or implied by such nomenclature.

Abstract

One embodiment provides a method for scalable predictive failure analysis. Embodiments of the method may include gathering memory information for memory on a user computer system having at least one processor. Further, the method includes selecting one or more memory-related parameters. Further still, the method includes calculating based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Yet further, the method includes setting, based on the calculating, the single bit error value for the user computer system.

Description

    BACKGROUND
  • Memory correctable errors are becoming a major issue in today's modern personal computers, especially since supported memory sizes often reach terabytes instead of gigabytes. To that end, complex predictive failure analyses are desirous in order to anticipate and prevent mild to catastrophic system failures involving data loss and damage due to memory errors.
  • BRIEF SUMMARY
  • One embodiment provides a method for scalable predictive failure analysis. Embodiments of the method may include gathering memory information for memory on a user computer system having at least one processor. Further, the method includes selecting one or more memory-related parameters from a plurality. Further still, the method includes calculating based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Yet further, the method includes setting, based on the calculating, the single bit error value for the user computer system.
  • Another embodiment provides a computer program product for scalable predictive failure analysis. The computer program product includes a computer readable storage device. Further, the computer program product includes first program instructions to gather memory information for memory on a user computer system having at least one processor. Further still, the computer program product includes second program instructions to select one or more memory-related parameters. Yet further, the computer program product includes third program instructions to calculate based on the gather and the select (i.e., performing the instructions to gather and to select), a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Still further, the computer program product includes fourth program instructions to set, based on the calculate (i.e., performing the instructions to calculate), the single bit error value for the user computer system, wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device.
  • Another embodiment provides a system for scalable predictive failure analysis. The system includes a processor, a computer readable memory and a computer readable storage device. Further, the system includes first program instructions to gather memory information for memory on a user computer system having at least one processor, wherein the memory may be the same, part of or different from the computer readable memory. Further still, the system includes second program instructions to select one or more memory-related parameters. Yet further, the system includes third program instructions to calculate, based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information. Further still, the system includes fourth program instructions to select, based on the calculate, the single bit error value for the user computer system. The first, second, third, and fourth program instructions of the system are stored on the computer readable storage device for execution by the processor via the computer readable memory.
  • BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
  • So that the manner in which the above recited features, advantages and objects of the present disclosure are attained and can be understood in detail, a more particular description of this disclosure, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
  • It is to be noted, however, that the appended drawings illustrate only example embodiments of this disclosure, and, therefore, are not to be considered limiting of its scope, for this disclosure may admit or not to other equally effective embodiments.
  • FIG. 1 depicts an example embodiment of a system for scalable predictive failure analysis in accordance with this disclosure.
  • FIG. 2 depicts a block diagram of an example embodiment of a computer system suitable for scalable predictive failure analysis, such as a user computer system.
  • FIG. 3 depicts an example embodiment of a flowchart to show a method for scalable predictive failure analysis in accordance with this disclosure.
  • FIG. 4 depicts another diagram of an example embodiment of a computer system suitable for scalable predictive failure analysis, such as a user computer system.
  • DETAILED DESCRIPTION
  • The following is a detailed description of example embodiments with accompanying drawings. The example embodiments are in such detail as to communicate the invention. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
  • Generally speaking, systems, methods and media for scalable predictive failure analysis (SPFA) for single bit errors (SBE) in memory are disclosed. Embodiments include gathering, for a user computer system, memory information, such as memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality. Calculation of the SBE value ensues through combining calculation(s) for each of the selected memory-related parameters, wherein the selecting optionally occurs subsequent or prior to the gathering. The calculated SBE value is set and valid for the user computer system until powering down or changing memory components in the user computer system. Accordingly, the SBE value is scalable because the value is determined for the particular user computer system—not simply a fixed, generic value. Alerts, whether audible or visible, may occur based on comparing counted SBEs to the scalable SBE value. The alerts provide credible predictive failure analysis to avert system memory failures while incorporating the realities of the unique complexities for the particular user computer system.
  • In general, the routines executed to implement the embodiments of the invention may be part of a specific application, component, program, module, object, or sequence of instructions. The computer program of the present invention typically is comprised of a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described herein may be identified based upon the application for which they are implemented in a specific embodiment of the invention. However, it should be appreciated that any particular program nomenclature herein is used merely for convenience, and thus the invention should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
  • While specific embodiments will be described below with reference to particular configurations of hardware and/or software, those of skill in the art will realize that embodiments of the present invention may advantageously be implemented with other substantially equivalent hardware, software systems, manual operations, or any combination of any or all of these. The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Moreover, embodiments of the invention may also be implemented via parallel processing using a parallel computing architecture, such as one using multiple discrete systems (e.g., plurality of computers, etc.) or an internal multiprocessing architecture (e.g., a single system with parallel processing capabilities).
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Aspects of embodiments of the invention described herein may be stored or distributed on computer-readable medium as well as distributed electronically over the Internet or over other networks, including wireless networks. Data structures and transmission of data (including wireless transmission) particular to aspects of the invention are also encompassed within the scope of the invention. Furthermore, the invention can take the form of a computer program product accessible from a computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
  • Each software program described herein may be operated on any type of data processing system, such as a personal computer, server, etc. A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements may include local memory employed during execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks, including wireless networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
  • Turning now to the drawings, FIG. 1 depicts a user computer system 100 having a collection of cooperating, algorithmic modules for SPFA calculations. The enabling logic for modules 110, 115, 120, 130, 140, 145 is reduced to software and/or hardware. The modules 110, 115, 120, 130, 140, 145, are located, for example, within the operating system of a user computer system 100. In alternative example embodiments, any of the modules 110, 115, 120, 130, 140, 145 may be located remotely but in network communication with the user computer system 100. Example of remote location may have some of the modules 110, 115, 120, 130, 140, 145 located on other computer systems, including manipulations and calculations of the generated data being the subject of a Web service.
  • Regardless of individual logic location, the system 100 has accessible logic to gather memory information for memory 105 on the user computer system 100. The gathering module 110 gathers memory information, memory size, synchronous dynamic random access memory (SDRAM) technology on the module, module packaging, memory failure mode and vendor quality for memory 105 under test on the particular user computer system 100. For example, memory information for memory 105 could be a module size of 2 GB for a single-rank dual in-line module (DIMM). Below, further discussion of memory information occurs in combination with discussion of selected memory-based parameters.
  • The system 100 also includes logic, denominated as a configuration module 120 in FIG. 1, for selecting one or more memory-related parameters from a plurality of such parameters. A user or administrator, for example, of the user computer system 100 selects which memory-related parameters to include in the SPFA calculations. The selecting may occur through textual entry, radial selection, or other method for selecting options through a display coupled to the user computer system 100. The selected memory-related parameters, themselves, directly correlate to memory information. That is, memory information regarding memory size correlates to the memory-related parameter for memory size, memory information regarding module packaging correlates to the memory-related parameter for module packaging, and so forth.
  • In communication with both the gathering and configuration modules 110, 120, the calculation module 130 includes logic to calculate a combination of the selected memory-related parameters. The SPFA uses the selected number of memory-related parameters, which one considers critical to maintain a functioning memory subsystem, in order to calculate the SBE value. The setting module 140 then sets the calculated SBE value for the system 100. Evaluation of exemplary memory-related parameters and combination of the same for calculation of the SBE value now ensues.
  • Memory module size is a memory-related parameter for possible inclusion in the SPFA calculation for the memory 105. For such, the following exemplary scale is provided for a correctable SBE value based on the actual capacity of each module or module-pairs installed in the system:
  • TABLE 1
    Module Size Scale Factor (n) PFA threshold in time window
    2 GB 1  x
    4 GB 2 2x
    8 GB 4 4x
    16 GB  8 8x
    32 GB  16 16x 

    Referring to Table 1, and assuming x=256 SBE for a baseline PFA count within a 24-hour window, then a larger memory 105 DIMM logically permits more SBEs before meeting or exceeding a set SBE value, i.e., a threshold. For example, the memory-based parameter for memory module size would allow 256 SBEs for a 2 GB DIMM, 512 SBEs for a 4 GB DIMM, 1024 SBEs for a 8 GB DIMM, 2048 SBEs for a 16 GB DIMM, and 4096 SBEs for a 32 GB DIMM before memory failure realized by visual and/or audio alert through use of the detection and comparison modules 115, 145.
  • In addition to memory module size, another possibly selected memory-related parameter for inclusion in the calculation of the SBE value is SDRAM technology on the memory module 105. For such, the following exemplary scale is provided:
  • TABLE 2
    Number of Rank Scale Factor (m) PFA threshold in time window
    1 (Single) 1 y
    2 (Dual) 1.2 y/1.2
    4 (Quad) 1.6 y/1.6

    Referring to Table 2, and assuming y=1024 for a baseline PFA count within a 24-hour window, memory 105 DIMM with a lesser rank permits a higher SBE value. For example, the memory-based parameter for SDRAM technology would allow 1024 SBEs for a single-rank DIMM, 823 SBEs for a dual-rank DIMM, and 640 SBEs for a quad-rank DIMM before alerting the user or another system in network communication with the system 100 of memory failure of a module or other memory device needing repair or replacement, whereupon the latter at least suggests a new SBE value should be re-set by re-calculation.
  • Still another memory-related parameter for inclusion in the calculation of the SBE value is module packaging of the memory 105 on the particular user computer system 100. For such, the following exemplary scale is provided:
  • TABLE 3
    SDRAM
    Data Width Scale Factor (k) PFA threshold in time window
    x8 (with no IBM ® 1   z
    Chipkill ™ tech-
    nology support)
    x8 (with IBM ® 2   2z
    Chipkill ™ support)
    x4 (with IBM ® 2.5 2.5z
    Chipkill ™ support)

    IBM® Chipkill™ is an advanced error checking and correcting (ECC) computer technology that has the ability to correct multi-bit memory errors on a single SDRAM. Referring to Table 3, and assuming z=256 for a baseline PFA count within a 24-hour window, memory 105 DIMM with additional advanced ECC protection, i.e., Chipkill™, affords a higher SBE value due to this individual PFA metric. For example, the memory-based parameter regarding Chipkill™ would allow 256 SBEs for x8 DIMM with no Chipkill™, 512 SBEs for x8 DIMM with Chipkill™ is, and 640 SBEs for x4 DIMM with Chipkill™
  • Yet another memory-related parameter for optional inclusion in the calculation of the SBE value is memory failure mode of the memory 105 on the particular user computer system 100. Here, this memory-related parameter regards single count reduction for a single memory address. That is, a correctable SBE that occurs repeatedly at the same memory address on memory 105 DIMM is counted as one failure instead of counting the repeats as multiple failures.
  • Another example of a memory-related parameter for optional inclusion in the calculation of the SBE value is vendor quality of the memory 105 on the particular user computer system 100. For such, the following exemplary scale is provided:
  • TABLE 4
    Number of Rank Scale Factor (m)
    Vendor A, Product 1 1
    Vendor A, Product 2 0.8
    Vendor B, Product 1 1
    Vendor C, Product 1 0.5

    Table 4 represents a memory vendor quality/reliability matrix on a per product basis. A memory vendor can have multiple products, each one could have a different quality/reliability rating. The quality scale rating, such as Table 4, may be used for calculating the SBE value. A memory 105 DIMM from a lower quality score supplier yields a lower PFA threshold value for this memory-related parameter. A lower quality score would require replacement or repair sooner as compared to a higher quality score provided all other contributing PFA memory-related parameters to the SBE value are constant.
  • For calculation purposes, combination of the selected, memory-related parameters may be through simple addition, multiplication, a mixture of the two, or any other combination method so as to yield a reliable, relative, and meaningful SBE value for SFPA. For example, the foregoing five memory-related parameters may calculate an SBE value according to: PFA(sum)=PFA(a)+PFA(b)+PFA(c)+PFA(d)+PFA(a). The value of each memory-related PFA threshold and time window(s) should be defined by the subject matter expert on the system design team. That is, the illustrative tables provided herein are neither the sole nor necessarily appropriate values to use because the same are solely intended as examples. Whether a hardware built-in memory test, power-on memory test (i.e., post-power on self-test), system in run time, or memory diagnostic test, this disclosure enables a selectable and scalable PFA for memory 105 that thwarts consequences of memory failures for a particular user computer system 100.
  • FIG. 2 depicts a block diagram of one embodiment of a computer system 200 suitable for use in scalable predictive failure analysis. Other possibilities for the computer system 200 are possible, including a computer having capabilities other than those ascribed herein and possibly beyond those capabilities, and they may, in other embodiments, be any combination of processing devices such as workstations, servers, mainframe computers, notebook or laptop computers, desktop computers, PDAs, mobile phones, wireless devices, set-top boxes, or the like. At least certain of the components of computer system 200 may be mounted on a multi-layer planar or motherboard (which may itself be mounted on the chassis) to provide a means for electrically interconnecting the components of the computer system 200.
  • In the depicted embodiment, the computer system 200 includes a processor 202, storage 204, memory 206, a user interface adapter 208, and a display adapter 210 connected to a bus 212 or other interconnect. The bus 212 facilitates communication between the processor 202 and other components of the computer system 200, as well as communication between components. Processor 202 may include one or more system central processing units (CPUs) or processors to execute instructions, such as an IBM® PowerPC® processor, an Intel® Pentium® processor, an Advanced Micro Devices, Inc. processor or any other suitable processor. IBM and PowerPC are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. The processor 202 may utilize storage 204, which may be non-volatile storage such as one or more hard drives, tape drives, diskette drives, CD-ROM drive, DVD-ROM drive, or the like. The processor 202 may also be connected to memory 206 via bus 212, such as via a memory controller hub (MCH). System memory 206 may include volatile memory such as random access memory (RAM) or double data rate (DDR) synchronous dynamic random access memory (SDRAM). In the disclosed systems, for example, a processor 202 may execute instructions to perform functions, such as by gathering memory information and selecting memory-related parameters for inclusion for SPFA calculations. Information before, during or after calculations may temporarily or permanently be stored in storage 204 or memory 206.
  • Turning now to FIG. 3, another aspect of scalable predictive failure analysis for memory associated with a particular user computer system is disclosed. At point is an example embodiment of a flowchart 300 for improved predictive failure analysis after having set the SBE value for the user computer system. Flowchart 300 is for a system, such as system 100, notably involving the logic associated with the detection and comparison modules 115, 145 of FIG. 1.
  • Returning to FIG. 3, flowchart 300 starts 305 by the system detecting 310 SBEs on a DIMM via a system management interrupt (SMI). When the user computer system boots, the BIOS or other BIOS implementation, such as Unified Extensible Firmware Interface (UEFI), interrupt factors are established. Upon the memory controller detecting 310 a SBE, SMI is triggered to notify wake up BIOS to check 320 the memory-related parameters and SBE counts accumulated so far. Decision block 330 queries whether the SBE count value is at least equal to set SBE value. If yes 340, then the flowchart 300 issues 350 an SPFA alert and optionally provides repair actions, such as displaying a visual notice to replace the specific faulty memory module or suggests reparative procedures. If no 335, then the flowchart 300 returns to sleep, at least until the next SBE is counted, because comparison of the counted SBEs for the particular user computer system is less than the set SBE value. Subsequent to the issuing 350 the alert with optional actions or no 335, the flowchart ends 375.
  • FIG. 4 illustrates information handling system 401 which is a simplified example of a computer system, such as shown in FIG. 2 for use in scalable predictive failure analysis, and capable of performing the operations described herein. Computer system 401 includes processor 400 which is coupled to host bus 405. A level two (L2) cache memory 410 is also coupled to the host bus 405. Host-to-PCI bridge 415 is coupled to main memory 420, includes cache memory and main memory control functions, and provides bus control to handle transfers among PCI bus 425, processor 400, L2 cache 410, main memory 420, and host bus 405. As an alternative to the foregoing, the level 2 cache 410, memory controller and the north bridge may be integrated into the CPU; then, the system main memory is connected to the memory controller, which is inside the CPU. PCI bus 425 provides an interface for a variety of devices including, for example, LAN card 430. PCI-to-ISA bridge 435 provides bus control to handle transfers between PCI bus 425 and ISA bus 440, universal serial bus (USB) functionality 445, IDE device functionality 450, power management functionality 455, and can include other functional elements not shown, such as a real-time clock (RTC), DMA control, interrupt support, and system management bus support. Peripheral devices and input/output (I/O) devices can be attached to various interfaces 460 (e.g., parallel interface 462, serial interface 464, infrared (IR) interface 466, keyboard interface 468, mouse interface 470, fixed disk (HDD) 472, removable storage device 474) coupled to ISA bus 440. Alternatively, many I/O devices can be accommodated by a super I/O controller (not shown) attached to ISA bus 440.
  • BIOS 480 is coupled to ISA bus 440, and incorporates the necessary processor executable code for a variety of low-level system functions and system boot functions. BIOS 480 can be stored in any computer readable medium, including magnetic storage media, optical storage media, flash memory, random access memory, read only memory, and communications media conveying signals encoding the instructions (e.g., signals from a network). In order to attach computer system 401 to another computer system to copy files over a network, LAN card 430 is coupled to PCI bus 425 and to PCI-to-ISA bridge 435. Similarly, to connect computer system 401 to an ISP to connect to the Internet using a telephone line connection, modem 475 is connected to serial port 464 and PCI-to-ISA Bridge 435.
  • While the computer systems described in FIGS. 2 and 4 are capable of executing the disclosure described herein, these computer systems are simply examples of computer systems and user computer systems. Those skilled in the art will appreciate that many other computer system designs are capable of performing the disclosure described herein.
  • Another embodiment of the disclosure is implemented as a program product for use within a device such as, for example, those systems and methods depicted in FIGS. 1 and 3. The program(s) of the program product defines functions of the embodiments (including the methods described herein) and can be contained on a variety of media including but not limited to: (i) information permanently stored on non-volatile storage-type accessible media (e.g., write and readable as well as read-only memory devices within a computer such as ROM, flash memory, CD-ROM disks readable by a CD-ROM drive); (ii) alterable information stored on writable storage-type accessible media (e.g., readable floppy disks within a diskette drive or hard-disk drive); and (iii) information conveyed to a computer through a network. The latter embodiment specifically includes information downloaded onto either permanent or even sheer momentary storage-type accessible media from the World Wide Web, an internet, and/or other networks, such as those known, discussed and/or explicitly referred to herein. Such data-bearing media, when carrying computer-readable instructions that direct the functions of the present disclosure, represent embodiments of the present disclosure.
  • In general, the routines executed to implement the embodiments of this disclosure, may be part of an operating system or a specific application, component, program, module, object, or sequence of instructions. The computer program of this disclosure typically comprises a multitude of instructions that will be translated by the native computer into a machine-readable format and hence executable instructions. Also, programs are comprised of variables and data structures that either reside locally to the program or are found in memory or on storage devices. In addition, various programs described hereinafter may be identified based upon the application for which they are implemented in a specific embodiment of this disclosure. However, it should be appreciated that any particular program nomenclature that follows is used merely for convenience, and thus this disclosure should not be limited to use solely in any specific application identified and/or implied by such nomenclature.
  • While the foregoing is directed to example embodiments of this disclosure, other and further embodiments of this disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (20)

1. A method for scalable predictive failure analysis, the method comprising:
gathering memory information for memory on a user computer system having at least one processor;
selecting one or more memory-related parameters;
calculating, based on the gathering and the selecting, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information; and
setting, based on the calculating, the single bit error value for the user computer system.
2. The method of claim 1, further comprising detecting, subsequent to the setting, one or more single bit errors for the memory.
3. The method of claim 1, further comprising comparing, subsequent to the setting, a counted number of single bit errors for the memory to the value.
4. The method of claim 1, further comprising alerting, subsequent to the setting, if a counted number of single bit errors for the memory at least equals the single bit error value.
5. The method of claim 1, further comprising returning to sleep, subsequent to the setting, if a counted number of single bit errors for the memory fails to exceed the single bit error value.
6. The method of claim 1, further comprising re-setting, according to the method, the single bit error value for the user computer system upon a memory replacement.
7. The method of claim 1, further comprising reporting the single bit error value and any results from the method on a display associated with the user computer system.
8. A computer program product for scalable predictive failure analysis:
a computer readable storage device;
first program instructions to gather memory information for memory on a user computer system having at least one processor;
second program instructions to select one or more memory-related parameters;
third program instructions to calculate based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information;
fourth program instructions to set, based on the calculate, the single bit error value for the user computer system; and
wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device.
9. The computer program product of claim 8, further comprising fifth program instructions to detect, subsequent to the set, one or more single bit errors for the memory; and wherein the fifth program instructions are stored on the computer readable storage device.
10. The computer program product of claim 8, further comprising fifth program instructions to compare, subsequent to the set, a counted number of single bit errors for the memory to the value; and wherein the fifth program instructions are stored on the computer readable storage device.
11. The computer program product of claim 8, further comprising fifth program instructions to alert, subsequent to the set, if a counted number of single bit errors for the memory at least equals the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device.
12. The computer program product of claim 8, further comprising fifth program instructions to return to sleep, subsequent to the set, if a counted number of single bit errors for the memory fails to exceed the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device.
13. The computer program product of claim 8, further comprising fifth program instructions to re-set, according to the method, the single bit error value for the user computer system upon a memory replacement; and wherein the fifth program instructions are stored on the computer readable storage device.
14. A system for scalable predictive failure analysis, the system comprising:
a processor, a computer readable memory and a computer readable storage device;
first program instructions to gather memory information for memory on a user computer system having at least one processor;
second program instructions to select one or more memory-related parameters;
third program instructions to calculate based on the gather and the select, a single bit error value for the scalable predictive failure analysis through calculations for each of the one or more memory-related parameters that utilize the memory information;
fourth program instructions to set, based on the calculate, the single bit error value for the user computer system; and
wherein the first, second, third, and fourth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
15. The system of claim 14, further comprising fifth program instructions to detect, subsequent to the set, one or more single bit errors for the memory; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
16. The system of claim 14, further comprising fifth program instructions to compare, subsequent to the set, a counted number of single bit errors for the memory to the value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
17. The system of claim 14, further comprising fifth program instructions to alert, subsequent to the set, if a counted number of single bit errors for the memory at least equals the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
18. The system of claim 14, further comprising fifth program instructions to return to sleep, subsequent to the setting, if a counted number of single bit errors for the memory fails to exceed the single bit error value; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
19. The system of claim 14, further comprising fifth program instructions to re-set, according to the method, the single bit error value for the user computer system upon a memory replacement; and wherein the fifth program instructions are stored on the computer readable storage device for execution by the processor via the computer readable memory.
20. The system of claim 14, further comprising fifth program instructions to report the single bit error value and any results from the method on a display associated with the user computer system; and wherein the fifth program instructions are stored on the computer readable storage device.
US12/912,735 2010-10-26 2010-10-26 Scalable Prediction Failure Analysis For Memory Used In Modern Computers Abandoned US20120102367A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US12/912,735 US20120102367A1 (en) 2010-10-26 2010-10-26 Scalable Prediction Failure Analysis For Memory Used In Modern Computers
US14/011,222 US9196383B2 (en) 2010-10-26 2013-08-27 Scalable prediction failure analysis for memory used in modern computers
US14/823,384 US20150347211A1 (en) 2010-10-26 2015-08-11 Scalable prediction failure analysis for memory used in modern computers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/912,735 US20120102367A1 (en) 2010-10-26 2010-10-26 Scalable Prediction Failure Analysis For Memory Used In Modern Computers

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/011,222 Continuation US9196383B2 (en) 2010-10-26 2013-08-27 Scalable prediction failure analysis for memory used in modern computers

Publications (1)

Publication Number Publication Date
US20120102367A1 true US20120102367A1 (en) 2012-04-26

Family

ID=45974011

Family Applications (3)

Application Number Title Priority Date Filing Date
US12/912,735 Abandoned US20120102367A1 (en) 2010-10-26 2010-10-26 Scalable Prediction Failure Analysis For Memory Used In Modern Computers
US14/011,222 Expired - Fee Related US9196383B2 (en) 2010-10-26 2013-08-27 Scalable prediction failure analysis for memory used in modern computers
US14/823,384 Abandoned US20150347211A1 (en) 2010-10-26 2015-08-11 Scalable prediction failure analysis for memory used in modern computers

Family Applications After (2)

Application Number Title Priority Date Filing Date
US14/011,222 Expired - Fee Related US9196383B2 (en) 2010-10-26 2013-08-27 Scalable prediction failure analysis for memory used in modern computers
US14/823,384 Abandoned US20150347211A1 (en) 2010-10-26 2015-08-11 Scalable prediction failure analysis for memory used in modern computers

Country Status (1)

Country Link
US (3) US20120102367A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832455B1 (en) * 2011-09-21 2014-09-09 Google Inc. Verified boot path retry
WO2023015699A1 (en) * 2021-08-12 2023-02-16 惠州Tcl云创科技有限公司 Method for debugging android platform camera module, storage medium, and terminal device

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933448B (en) * 2014-12-25 2021-04-20 华为技术有限公司 Method and device for predicting fault of nonvolatile storage medium
CN105988918B (en) 2015-02-26 2019-03-08 阿里巴巴集团控股有限公司 The method and apparatus for predicting GPU failure
US10048877B2 (en) * 2015-12-21 2018-08-14 Intel Corporation Predictive memory maintenance
US10482040B2 (en) 2017-12-21 2019-11-19 International Business Machines Corporation Method, system, and apparatus for reducing processor latency
RU2718215C2 (en) 2018-09-14 2020-03-31 Общество С Ограниченной Ответственностью "Яндекс" Data processing system and method for detecting jam in data processing system
RU2714219C1 (en) 2018-09-14 2020-02-13 Общество С Ограниченной Ответственностью "Яндекс" Method and system for scheduling transfer of input/output operations
RU2731321C2 (en) * 2018-09-14 2020-09-01 Общество С Ограниченной Ответственностью "Яндекс" Method for determining a potential fault of a storage device
RU2714602C1 (en) 2018-10-09 2020-02-18 Общество С Ограниченной Ответственностью "Яндекс" Method and system for data processing
RU2721235C2 (en) 2018-10-09 2020-05-18 Общество С Ограниченной Ответственностью "Яндекс" Method and system for routing and execution of transactions
RU2711348C1 (en) 2018-10-15 2020-01-16 Общество С Ограниченной Ответственностью "Яндекс" Method and system for processing requests in a distributed database
US10783025B2 (en) * 2018-10-15 2020-09-22 Dell Products, L.P. Method and apparatus for predictive failure handling of interleaved dual in-line memory modules
RU2714373C1 (en) 2018-12-13 2020-02-14 Общество С Ограниченной Ответственностью "Яндекс" Method and system for scheduling execution of input/output operations
RU2749649C2 (en) 2018-12-21 2021-06-16 Общество С Ограниченной Ответственностью "Яндекс" Method and system for scheduling processing of i/o operations
RU2720951C1 (en) 2018-12-29 2020-05-15 Общество С Ограниченной Ответственностью "Яндекс" Method and distributed computer system for data processing
RU2746042C1 (en) 2019-02-06 2021-04-06 Общество С Ограниченной Ответственностью "Яндекс" Method and the system for message transmission
KR20230036730A (en) 2021-09-08 2023-03-15 삼성전자주식회사 Memory controller and the memory system comprising the same

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5959537A (en) * 1998-07-09 1999-09-28 Mcgraw-Edison Company Variable trip fault indicator
US20040215912A1 (en) * 2003-04-24 2004-10-28 George Vergis Method and apparatus to establish, report and adjust system memory usage
US20060101308A1 (en) * 2004-10-21 2006-05-11 Agarwal Manoj K System and method for problem determination using dependency graphs and run-time behavior models
US20070006048A1 (en) * 2005-06-29 2007-01-04 Intel Corporation Method and apparatus for predicting memory failure in a memory system
US7328376B2 (en) * 2003-10-31 2008-02-05 Sun Microsystems, Inc. Error reporting to diagnostic engines based on their diagnostic capabilities
US20100306598A1 (en) * 2009-06-02 2010-12-02 International Business Machines Corporation Operating Computer Memory
US8074062B2 (en) * 2008-08-11 2011-12-06 Dell Products, L.P. Method and system for using a server management program for an error configuration table
US20120079314A1 (en) * 2010-09-27 2012-03-29 International Business Machines Corporation Multi-level dimm error reduction

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206764A1 (en) 2005-03-11 2006-09-14 Inventec Corporation Memory reliability detection system and method
US7474989B1 (en) 2005-03-17 2009-01-06 Rockwell Collins, Inc. Method and apparatus for failure prediction of an electronic assembly using life consumption and environmental monitoring
KR20070017749A (en) 2005-08-08 2007-02-13 (주) 기산텔레콤 Method for Predicting Life of Non-Volatile Random Access Memory of Communication Apparatus
US7631228B2 (en) * 2006-09-12 2009-12-08 International Business Machines Corporation Using bit errors from memory to alter memory command stream
US7356442B1 (en) 2006-10-05 2008-04-08 International Business Machines Corporation End of life prediction of flash memory
CN101266840B (en) 2008-04-17 2012-05-23 北京航空航天大学 A life prediction method for flash memory electronic products
TWI486764B (en) * 2009-10-30 2015-06-01 Silicon Motion Inc Data storage device, controller, and method for data access of a downgrade memory
US8639964B2 (en) * 2010-03-17 2014-01-28 Dell Products L.P. Systems and methods for improving reliability and availability of an information handling system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5959537A (en) * 1998-07-09 1999-09-28 Mcgraw-Edison Company Variable trip fault indicator
US20040215912A1 (en) * 2003-04-24 2004-10-28 George Vergis Method and apparatus to establish, report and adjust system memory usage
US7328376B2 (en) * 2003-10-31 2008-02-05 Sun Microsystems, Inc. Error reporting to diagnostic engines based on their diagnostic capabilities
US20060101308A1 (en) * 2004-10-21 2006-05-11 Agarwal Manoj K System and method for problem determination using dependency graphs and run-time behavior models
US20070006048A1 (en) * 2005-06-29 2007-01-04 Intel Corporation Method and apparatus for predicting memory failure in a memory system
US8074062B2 (en) * 2008-08-11 2011-12-06 Dell Products, L.P. Method and system for using a server management program for an error configuration table
US20100306598A1 (en) * 2009-06-02 2010-12-02 International Business Machines Corporation Operating Computer Memory
US20120079314A1 (en) * 2010-09-27 2012-03-29 International Business Machines Corporation Multi-level dimm error reduction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Determination of Excessive Single-Bit Errors", September 1, 1994, IBM Technical Disclosure Bulletin, Vol 37, iss 9, page 267-270. *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832455B1 (en) * 2011-09-21 2014-09-09 Google Inc. Verified boot path retry
WO2023015699A1 (en) * 2021-08-12 2023-02-16 惠州Tcl云创科技有限公司 Method for debugging android platform camera module, storage medium, and terminal device

Also Published As

Publication number Publication date
US9196383B2 (en) 2015-11-24
US20150347211A1 (en) 2015-12-03
US20140013170A1 (en) 2014-01-09

Similar Documents

Publication Publication Date Title
US9196383B2 (en) Scalable prediction failure analysis for memory used in modern computers
US10789117B2 (en) Data error detection in computing systems
US7945815B2 (en) System and method for managing memory errors in an information handling system
US7702971B2 (en) System and method for predictive failure detection
US10146651B2 (en) Member replacement in an array of information storage devices
US20070088988A1 (en) System and method for logging recoverable errors
US20090150721A1 (en) Utilizing A Potentially Unreliable Memory Module For Memory Mirroring In A Computing System
US8812915B2 (en) Determining whether a right to use memory modules in a reliability mode has been acquired
US20150074467A1 (en) Method and System for Predicting Storage Device Failures
US20140188829A1 (en) Technologies for providing deferred error records to an error handler
US20190026239A1 (en) System and Method to Correlate Corrected Machine Check Error Storm Events to Specific Machine Check Banks
US10936411B2 (en) Memory scrub system
US20180246775A1 (en) System and Method for Providing Predictive Failure Detection on DDR5 DIMMs Using On-Die ECC
US20080244302A1 (en) System and method to enable an event timer in a multiple event timer operating environment
US8122176B2 (en) System and method for logging system management interrupts
US8122291B2 (en) Method and system of error logging
US10613953B2 (en) Start test method, system, and recording medium
US10268598B2 (en) Primary memory module with record of usage history
US11593209B2 (en) Targeted repair of hardware components in a computing device
US11080124B2 (en) System and method for targeted efficient logging of memory failures
KR101001071B1 (en) Method and apparatus of reporting memory bit correction
US11126502B2 (en) Systems and methods for proactively preventing and predicting storage media failures

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DANG, TU T.;ELLIS, MICHAEL C.;HERNANDEZ, JUAN Q.;AND OTHERS;SIGNING DATES FROM 20101027 TO 20101102;REEL/FRAME:025672/0741

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO PAY ISSUE FEE