US20080162890A1 - Computer processing system employing an instruction reorder buffer - Google Patents

Computer processing system employing an instruction reorder buffer Download PDF

Info

Publication number
US20080162890A1
US20080162890A1 US11/531,042 US53104206A US2008162890A1 US 20080162890 A1 US20080162890 A1 US 20080162890A1 US 53104206 A US53104206 A US 53104206A US 2008162890 A1 US2008162890 A1 US 2008162890A1
Authority
US
United States
Prior art keywords
superrob
processors
instruction
instructions
ilp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/531,042
Other versions
US7395416B1 (en
Inventor
Sumedh W. Sathaye
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/531,042 priority Critical patent/US7395416B1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Sathaye, Sumedh W.
Priority to US12/127,844 priority patent/US20090055633A1/en
Priority to US12/127,845 priority patent/US20080229077A1/en
Application granted granted Critical
Publication of US7395416B1 publication Critical patent/US7395416B1/en
Publication of US20080162890A1 publication Critical patent/US20080162890A1/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory

Definitions

  • IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
  • This invention relates to employing an instruction reorder buffer, and particularly to a technique that takes at least two processors that are optimized to execute dependence chains, and co-locate the processors with a superstructure called SuperROB (Super Re-Order Buffer).
  • SuperROB Super Re-Order Buffer
  • a dependence chain is a sequence of instructions in a program in which a temporally sequential instruction is data-dependent on a temporally previous instruction.
  • Examples of key data dependence paths that processors optimize are: load-compare-branch, load-load, load-compute, and compute-compute latencies.
  • Examples of such processors are: the PPE (Power Processing Element) core on the Sony-Toshiba-IBM Broadband Engine, the IBM Power3 core, Itanium cores from Intel®, and almost all of the modern cores implementing z/Architecture technologies.
  • ILP Instruction Level Parallelism
  • TLP Thread Level Parallelism
  • processor technologies utilize the ILP and TLP workloads differently to achieve greater processor performance.
  • ILP and TLP system architectures it is difficult to optimize the processor for both high-throughput TLP-oriented and ILP-oriented applications. It is very cumbersome to map ILP applications on one or more TLP cores.
  • alternative processor architectures are necessary for providing ILP extraction on demand, for allowing global communication, for allowing efficient ILP exposition, extraction, and exploitation, and for efficiently operating across a plurality of TLP cores.
  • a method for operating a plurality of processors that each includes an execution pipeline for processing dependence chains, the method comprising: configuring the plurality of processors to execute the dependence chains on execution pipelines; implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered for out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core; detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority; determining a source of the imbalance; and activating the ILP mode when the source of the imbalance has been determined.
  • SuperROB Super Re-Order Buffer
  • ILP Instruction Level Parallelism
  • a system for operating a plurality of processors that each includes an execution pipeline for processing dependence chains
  • the system comprising: a network; and a host system in communication with the network, the host system including software to implement a method comprising: configuring the plurality of processors to execute the dependence chains on execution pipelines; implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered for out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core; detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority; determining a source of the imbalance; and activating the ILP mode when the source of the imbalance has been determined.
  • SuperROB Super Re-Order Buffer
  • ILP Instruction Level Parallelism
  • FIG. 1 illustrates one example of an Instruction Level Parallelism (ILP) workload
  • FIG. 2 illustrates one example of a Thread Level Parallelism (TLP) workload
  • FIG. 3 illustrates one example of a Single Instruction, Multiple Data (SIMD) vector workload
  • FIG. 4 illustrates one example of a TLP chip and a TLP & ILP Chip including a SuperROB
  • FIG. 5 illustrates one example of an in-order core for the TLP workload
  • FIG. 6 illustrates one example of a Super Re-Order Buffer (SuperROB);
  • FIG. 7 illustrates one example of a SuperROB operated in the TLP workload mode
  • FIG. 8 illustrates one example of a SuperROB operated in the ILP workload mode
  • FIG. 9 illustrates one example of a SuperROB per entry diagram
  • FIG. 10 illustrates one example of a manner in which two cores are connected to each other by a SuperROB structure
  • FIG. 11 illustrates one example of a SuperROB in ILP mode having an Ifetch working with a single trace cache line
  • FIG. 12 illustrates one example of a SuperROB shown as a series of queues.
  • One aspect of the exemplary embodiments is a superstructure called SuperROB (Super Re-Order Buffer) that operates across a plurality of TLP cores.
  • Another aspect of the exemplary embodiments is a method of mapping ILP applications on a TLP core by providing for ILP extraction on demand.
  • FIG. 1 illustrates one example of an ILP workload using such processing paradigm.
  • FIG. 1 there are three semi-independent chains of dependences that contain load instructions. Key data dependence paths that the processor optimizes are compute-compute latencies. Furthermore, high-accuracy branch prediction is usually a necessary condition to improve the performance of high-ILP workloads. In order to achieve high execution performance in a program area having high instruction-level parallelism, the processor contains large computational resources. On the contrary, in a program area having low instruction-level parallelism, even a processor containing small computational resources can achieve sufficient performance.
  • the ILP program contains multiple chains of instructions such that the instructions in each chain are clearly data dependent upon each other, but the chains themselves are mostly data-independent of each other.
  • there are three data-dependence chains in the program and the first 10 and the third 14 chains of dependences are dependent on the last operation in the middle 12 chain.
  • Chain 10 in turn, dependent on the chain on the last operation in the rightmost chain, chain 14 .
  • Across the three chains 10 , 12 , 14 there is opportunity to overlap the execution of computation instructions with that of other computation instructions, and execution of long-latency memory accesses with other that of computations. It is usually necessary to provide highly accurate branch prediction hardware so as to be able to continue the supply of non-speculative instructions to the main pipeline.
  • This nature of ILP programs can be exploited by processor hardware, which allows multiple-issue of data-independent instructions. Examples of processor hardware that falls in this category are: IBM Power4 and Power5 processors, AMD Opteron processor, and Intel Pentium4 processor.
  • FIG. 2 illustrates one example of a TLP workload.
  • TLP is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers, such as databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur. As a result, while one thread is delayed waiting for a memory or disk access, other threads can do ‘useful’ work in order to keep the processor running efficiently.
  • the program in the center of the figure is a pure data-dependence chain 16 .
  • Each instruction in the program is data-dependent on the immediately previous instruction.
  • the execution of an instruction cannot begin until the result datum or the outcome of the previous instruction is available.
  • the hardware complexity of processor hardware with multiple, independent instruction issue hardware capability proves to be unnecessary burden when executing a data-dependence chain program.
  • thread-level parallelism in a multiprocessor architecture considerably depends on how efficient parallel algorithms are, as well as how efficient a multiprocessor architecture itself is. Scalability of the parallel algorithms is a significant characteristic since running large algorithms in the multiprocessor architecture is essential.
  • FIG. 3 illustrates a SIMD workload.
  • SIMD Single Instruction, Multiple Data
  • MIMD Magnetic Ink Characterization
  • SIMD systems typically include only those instructions that can be applied to all of the data in one operation. In other words, if the SIMD system works by loading up eight data points at once, the “add” operation being applied to the data occurs to all eight values at the same time. Although the same is true for any superscalar processor design, the level of parallelism in a SIMD system is typically much higher.
  • SIMD architectures are essential in the parallel world of computers. The ability of the SIMD to manipulate large vectors and matrices in minimal time has created a phenomenal demand of these architectures. The power behind this type of architecture can be realized when the number of processor elements is equivalent to the size of the vector. In this situation, component-wise addition and multiplication of vector elements can be done simultaneously. Even when the size of the vector is larger than the number of processor elements available, the speedup is immense.
  • SIMD architectures There are two types. The first is the True SIMD and the second is the Pipelined SIMD.
  • the program is a data-parallel program, and is shown in the rightmost program representation.
  • the instructions in a data-parallel program operate on data structures that are vectors, rather than scalars.
  • Data-parallel programs can be either of the ILP nature, or may be a data-dependence chain.
  • the exemplary embodiments of the present invention provide a mechanism to “morph” a computer processor complex, each element of which is designed and optimized to perform work of one kind, into a complex, which can, with relatively high efficiency, perform another kind of work. In doing so, the processor complex transforms itself, on demand, into a single processing structure.
  • Each pair of cores on the TLP chip is connected with each other using a SuperROB (super-instruction re-order buffer).
  • SuperROB super-instruction re-order buffer
  • the concept of SuperROB is an extension of the re-order buffer (ROB) used in modern ILP processors.
  • the SuperROB is shown as a queue 44 in FIG. 4 .
  • the top portion of FIG. 4 is a TLP chip 40 and the bottom portion of FIG. 4 is a TLP & ILP chip 42 configuration.
  • the basic idea is that when presented with an ILP program, the two cores transform themselves into behaving as one. Therefore, instructions are supplied to the two cores by means of the SuperROB and the state of each instruction is captured in a single entry in the SuperROB. Also, the architected state of the program is captured in the register file of one of the two cores.
  • the SuperROB thus is a mechanism of global communication of program values, and a mechanism to expose, explore, and exploit the instruction-level parallelism inherent in an ILP program.
  • the plurality of cores supplied for the purposes of TLP are combined in an innovative fashion to also target ILP programs.
  • FIG. 5 illustrates an in-order core for TLP workloads.
  • FIG. 5 depicts an instruction memory 50 , instruction data 52 , stored data 54 , “data memory” data 56 , and a data memory 58 .
  • FIG. 5 there are several semi-independent chains of dependences that contain load instructions. Key data dependence paths that the processor optimizes are compute-compute, load-to-use, and compare-to-branch latencies.
  • the in-order processor comprises multiple execution pipelines, there is no register renaming in the processor pipeline, and no mechanism to enforce orderly completion of instructions to maintain sanctity of architectural state. Thus, the instructions are not issued out of order.
  • the out-of-order instruction processing in OOP necessitates a mechanism to store the instructions in the original program order. If a temporally later instruction causes an exception before a temporally earlier instruction, then the exception must be withheld from recognition until the temporally earlier instruction has completed execution and updated the architected state as appropriate. To help alleviate this problem, a larger number of instructions are stored in program order in a buffer called the re-order buffer to allow precise exception handling. While precise exception handling is the primary motivation behind having a reorder buffer, it has also been used to find more instructions that are not dependent upon each other. The size of reorder buffers has been growing in most modern commercial computer architectures with some processors able to store as many as 126 instructions in-flight.
  • FIG. 6 illustrates one example of a Super Re-Order Buffer (SuperROB).
  • FIG. 6 depicts a first instruction memory 60 , a first TLP core 62 , a first data memory 64 , a SuperROB 66 , a second instruction memory 68 , a second TLP core 70 , and a second data memory 72 .
  • the SuperROB architecture provides for ILP extraction on demand, it operates across a plurality of TLP cores, it allows for global communication, and it allows for efficient ILP exposition, extraction, and exploitation.
  • FIG. 6 shows two TLP cores that are separated by a buffer (SuperROB).
  • the SuperROB acts as the communication mechanism between the two TLP cores. When the processor is in TLP mode, then the SuperROB is turned off. When the processor is in ILP mode, then the SuperROB is turned on.
  • register renaming All contemporary dynamically scheduled processors support register renaming to cope with false data dependences.
  • One of the ways to implement register renaming is to use the slots within the Reorder Buffer (ROB) as physical registers.
  • the ROB is a large multi-ported structure that occupies a significant portion of the die area and dissipates a sizable fraction of the total chip power.
  • the heavily ported ROB is also likely to have a large delay that can limit the processor clock rate. However, by utilizing a SuperROB these delays may be minimized.
  • FIG. 7 illustrates one example of a SuperROB operated in the TLP workload mode
  • FIG. 8 illustrates one example of a SuperROB operated in the ILP workload mode.
  • the SuperROB is turned off.
  • the SuperROB is turned on in order to facilitate instruction management.
  • received instructions are received from at least two of the plurality of processors from a single input source.
  • renaming based on a SuperROB uses a physical register file that is the same size as the architectural register file, together with a set of registers arranged as a queue data structure. This facilitates faster processing.
  • the cache may be accessed every alternate fetch cycle, thus providing even greater processing performance.
  • the ICache is shared, and one of the cores (which one is a matter of convention) places requests for the two subsequent cache lines to fetch instructions from. “Next line A” is sent to the first core, and the ‘next-next line B’ is sent to the other core.
  • the fetch logic for each of the two cores places their instructions in the SuperROB in the original program order. After that point in time, the available instructions in the SuperROB could be picked up and worked on by either of the two cores.
  • FIG. 9 shows the structure of each entry in the SuperROB.
  • Each entry has a back or front pointer field, which is used by the ROB management hardware as a circular queue of ROB entries. That is followed by a set of status flags per entry, which indicate if the entry is being worked on by a core, or is available to be worked on.
  • Next are two fields used exclusively to hold the prediction and the outcome of branch instructions.
  • Next is a series of three fields, two for source register operands in the instruction, and one for the target register operand.
  • Each source register field holds the id or number of the ROB entry that produced the value, which is useful in determining if the instruction is ready for execution.
  • the target register field holds the architected register name into which the target register value must be committed when the instruction is retired.
  • the value of the operand is also held along with each register field.
  • the target register value is used to hold the datum to be stored in memory. More fields could be added on a per-instruction basis, and managed as needed.
  • the processor via the SuperROB, becomes a pure dataflow micro-architecture, where each entry in the SuperROB holds all the data pertaining to a single instruction in flight.
  • the data contained may be source register values (as and when available), target register values (as values are produced), memory store values (for store instructions), and branch outcome values (predicates).
  • the instructions are fetched in program order by using a protocol followed by two TLP front-ends, as illustrated in FIG. 9 .
  • One SuperROB entry is allocated for each decoded instruction.
  • each fetched instruction could be from separate ICaches, Trace Cache or other cache types.
  • the decode logic of each pipeline operates independently of each other. Thus, both pipelines of cores A and B of FIG. 8 monitor the SuperROB, and pick up the work, and do the work when work is available. The results of the work are written back to the appropriate SuperROB entry.
  • independently decoupled state machines operate in a purely dataflow fashion.
  • a state machine decodes instructions to rename its source operands (to the temporally preceding SuperROB entry numbers, or fetch values from architected registers).
  • the state machine also fetches values from SuperROB entries and updates the sources of the waiting instructions.
  • the state machine also marks the instructions that are ready to be executed and dispatches instructions to the execution backend.
  • the backend logic updates the appropriate SuperROB entry upon completion. As a result, there are no separate bypasses between the two independent execution backends and all the communication between the two pipelines is carried out via the SuperROB.
  • the exemplary embodiments of the present application are not limited to the structures in FIGS. 1-9 .
  • more than two cores could be connected to ‘morph’ the processor.
  • the state machine may also fetch instructions every alternate cycle from the Icaches or from an Ifetch buffer. Therefore, there may be variations based on pre-decode information that is available from the ICaches.
  • a split of the SuperROB is possible. The split may be for a register data-flow and for a memory data-flow (separate load/store associative lookup queue).
  • variations on the contents of SuperROB entries is allowed, variations based on the basic nature of the TLP core are allowed, and variations based on Simultaneous Multithreading Processor (SMT) or not-SMT is allowed.
  • SMT Simultaneous Multithreading Processor
  • the SuperROB is a queue of instructions, with each entry also holding other information about the instruction.
  • the computer system operates in either TLP (thread-level parallel) mode, or ILP mode.
  • TLP mode it is understood that the programs to be executed on the system are data-dependence chains programs.
  • ILP mode the programs to be executed on the system are ILP programs.
  • the SuperROB is disabled when the computer is in TLP mode, and it is enabled when the computer is in ILP mode.
  • Change of mode could be carried out in a variety of ways, for example, under explicit control of the programmer, or under implicit control of the OS or the HyperVisor, or under pure hardware control with the processor having monitoring hardware that watches the amount of dependence nature of instructions temporally and switches the mode from TLP to ILP or vice-versa.
  • the instruction fetch logic is shown working with a single trace cache line A (prediction for which is supplied by one of the two cores).
  • the trace cache now holds a single ILP program (which is unified rather than shared as in the TLP mode). Parts of the trace line are placed in SuperROB by one core, and the remaining part is placed by the other core.
  • the SuperROB is shown as a series of queues, the previous queue feeding the next, as a physical implementation of a logically single SuperROB structure. This could work with a regular ICache or a trace cache.
  • instructions are placed in the SuperROB, in program order, by one or both the IFetch stages of logic connected to it.
  • the Decode stages of logic from both the cores carry out the task of instruction decode, and update the status of instructions.
  • the Issue logic stages from the two cores pick up decodes instructions, and issue them to their respective execution back-ends.
  • One of the two register files is used to hold the architected state of the program, which one, is decided by convention. The other one is not used.
  • the instruction's status is updated in the SuperROB. This general manner of execution continues until the mode of the machine remains the ILP mode. It is to be generally understood that the ICache shown in the figure above holds a single program for execution when in ILP mode.
  • the capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
  • one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media.
  • the media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention.
  • the article of manufacture can be included as a part of a computer system or sold separately.
  • At least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.

Abstract

A method and a system for operating a plurality of processors that each includes an execution pipeline for processing dependence chains, the method comprising: configuring the plurality of processors to execute the dependence chains on execution pipelines; implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered after out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core; detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority; determining a source of the imbalance; and activating the ILP mode when the source of the imbalance has been determined.

Description

    GOVERNMENT INTEREST
  • This invention was made with Government support under contract No.: NBCH3039004 awarded by Defense Advanced Research Projects Agency (DARPA). The government has certain rights in this invention.
  • TRADEMARKS
  • IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This invention relates to employing an instruction reorder buffer, and particularly to a technique that takes at least two processors that are optimized to execute dependence chains, and co-locate the processors with a superstructure called SuperROB (Super Re-Order Buffer).
  • 2. Description of Background
  • Many processors designed today are optimized for execution of tight dependence chains. A dependence chain is a sequence of instructions in a program in which a temporally sequential instruction is data-dependent on a temporally previous instruction. Examples of key data dependence paths that processors optimize are: load-compare-branch, load-load, load-compute, and compute-compute latencies. Examples of such processors are: the PPE (Power Processing Element) core on the Sony-Toshiba-IBM Broadband Engine, the IBM Power3 core, Itanium cores from Intel®, and almost all of the modern cores implementing z/Architecture technologies.
  • Current research in processor technology and computer architecture is motivated primarily by the desire for greater performance. Greater performance may be achieved by increasing parallelism in execution. There are two kinds of parallelism in typical program workloads. These are Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). Some modern computer processors are specifically designed to capture ILP in programs (for example, IBM Power4 & 5, Intel Pentium), while multiprocessor systems are designed to capture TLP across threads or processes. Processor cores that are optimized to execute dependence chains are often also expected to execute ILP workloads. ILP workloads have more than one concurrent dependence chain, and overlapped execution of the chains is typically possible, provided the ILP between the chains has been exposed and exploited by the machine.
  • The evolution of microprocessor design has led to processors with higher clock frequencies to improve single-tread performance. These processors exploit ILP to speed up single-threaded applications. ILP attempts to increase performance by determining, at run time, instructions that can be executed in parallel. The trade-off is that ILP extraction requires highly complex microprocessors that consume a significant amount of power.
  • Thus, it is well known that different processor technologies utilize the ILP and TLP workloads differently to achieve greater processor performance. However, in existing ILP and TLP system architectures it is difficult to optimize the processor for both high-throughput TLP-oriented and ILP-oriented applications. It is very cumbersome to map ILP applications on one or more TLP cores. Thus, alternative processor architectures are necessary for providing ILP extraction on demand, for allowing global communication, for allowing efficient ILP exposition, extraction, and exploitation, and for efficiently operating across a plurality of TLP cores.
  • SUMMARY OF THE INVENTION
  • The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method for operating a plurality of processors that each includes an execution pipeline for processing dependence chains, the method comprising: configuring the plurality of processors to execute the dependence chains on execution pipelines; implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered for out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core; detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority; determining a source of the imbalance; and activating the ILP mode when the source of the imbalance has been determined.
  • The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a system for operating a plurality of processors that each includes an execution pipeline for processing dependence chains, the system comprising: a network; and a host system in communication with the network, the host system including software to implement a method comprising: configuring the plurality of processors to execute the dependence chains on execution pipelines; implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered for out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core; detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority; determining a source of the imbalance; and activating the ILP mode when the source of the imbalance has been determined.
  • Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and the drawings.
  • TECHNICAL EFFECTS
  • As a result of the summarized invention, technically we have achieved a solution that takes at least two processors that are optimized to execute dependence chains, and co-locate the processors with a superstructure called SuperROB (Super Re-Order Buffer).
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The subject matter, which is regarded as the invention, is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
  • FIG. 1 illustrates one example of an Instruction Level Parallelism (ILP) workload;
  • FIG. 2 illustrates one example of a Thread Level Parallelism (TLP) workload;
  • FIG. 3 illustrates one example of a Single Instruction, Multiple Data (SIMD) vector workload;
  • FIG. 4 illustrates one example of a TLP chip and a TLP & ILP Chip including a SuperROB;
  • FIG. 5 illustrates one example of an in-order core for the TLP workload;
  • FIG. 6 illustrates one example of a Super Re-Order Buffer (SuperROB);
  • FIG. 7 illustrates one example of a SuperROB operated in the TLP workload mode;
  • FIG. 8 illustrates one example of a SuperROB operated in the ILP workload mode;
  • FIG. 9 illustrates one example of a SuperROB per entry diagram;
  • FIG. 10 illustrates one example of a manner in which two cores are connected to each other by a SuperROB structure;
  • FIG. 11 illustrates one example of a SuperROB in ILP mode having an Ifetch working with a single trace cache line; and
  • FIG. 12 illustrates one example of a SuperROB shown as a series of queues.
  • DETAILED DESCRIPTION OF THE INVENTION
  • One aspect of the exemplary embodiments is a superstructure called SuperROB (Super Re-Order Buffer) that operates across a plurality of TLP cores. Another aspect of the exemplary embodiments is a method of mapping ILP applications on a TLP core by providing for ILP extraction on demand.
  • For a long time, the secret to more performance was to execute more instructions per cycle, otherwise known as ILP, or decreasing the effective latency of instructions. To execute more instructions each cycle, more functional units (e.g., integer, floating point, load/store units, etc.) had to be added. In order to more consistently execute multiple instructions, a processing paradigm called out-of-order processing (OOP) may be used. FIG. 1 illustrates one example of an ILP workload using such processing paradigm.
  • In FIG. 1, there are three semi-independent chains of dependences that contain load instructions. Key data dependence paths that the processor optimizes are compute-compute latencies. Furthermore, high-accuracy branch prediction is usually a necessary condition to improve the performance of high-ILP workloads. In order to achieve high execution performance in a program area having high instruction-level parallelism, the processor contains large computational resources. On the contrary, in a program area having low instruction-level parallelism, even a processor containing small computational resources can achieve sufficient performance.
  • Furthermore, concerning FIG. 1, the ILP program contains multiple chains of instructions such that the instructions in each chain are clearly data dependent upon each other, but the chains themselves are mostly data-independent of each other. As shown, there are three data-dependence chains in the program, and the first 10 and the third 14 chains of dependences are dependent on the last operation in the middle 12 chain. Chain 10, in turn, dependent on the chain on the last operation in the rightmost chain, chain 14. Across the three chains 10, 12, 14, there is opportunity to overlap the execution of computation instructions with that of other computation instructions, and execution of long-latency memory accesses with other that of computations. It is usually necessary to provide highly accurate branch prediction hardware so as to be able to continue the supply of non-speculative instructions to the main pipeline. This nature of ILP programs can be exploited by processor hardware, which allows multiple-issue of data-independent instructions. Examples of processor hardware that falls in this category are: IBM Power4 and Power5 processors, AMD Opteron processor, and Intel Pentium4 processor.
  • FIG. 2 illustrates one example of a TLP workload. In FIG. 2, there is one semi-independent chain of dependence that contains load instructions. The key data dependence path that the processor optimizes is a compute-compute latency. TLP is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers, such as databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur. As a result, while one thread is delayed waiting for a memory or disk access, other threads can do ‘useful’ work in order to keep the processor running efficiently.
  • Furthermore, concerning FIG. 2, the program in the center of the figure is a pure data-dependence chain 16. Each instruction in the program is data-dependent on the immediately previous instruction. Thus, the execution of an instruction cannot begin until the result datum or the outcome of the previous instruction is available. The hardware complexity of processor hardware with multiple, independent instruction issue hardware capability proves to be unnecessary burden when executing a data-dependence chain program. In addition, thread-level parallelism in a multiprocessor architecture considerably depends on how efficient parallel algorithms are, as well as how efficient a multiprocessor architecture itself is. Scalability of the parallel algorithms is a significant characteristic since running large algorithms in the multiprocessor architecture is essential.
  • FIG. 3 illustrates a SIMD workload. In computing, SIMD (Single Instruction, Multiple Data) is a set of operations for efficiently handling large quantities of data in parallel, as in a vector processor or array processor. First popularized in large-scale supercomputers (as opposed to MIMD parallelization), smaller-scale SIMD operations have now become widespread in personal computer hardware. Today the term is associated almost entirely with these smaller units. An advantage is that SIMD systems typically include only those instructions that can be applied to all of the data in one operation. In other words, if the SIMD system works by loading up eight data points at once, the “add” operation being applied to the data occurs to all eight values at the same time. Although the same is true for any superscalar processor design, the level of parallelism in a SIMD system is typically much higher.
  • SIMD architectures are essential in the parallel world of computers. The ability of the SIMD to manipulate large vectors and matrices in minimal time has created a phenomenal demand of these architectures. The power behind this type of architecture can be realized when the number of processor elements is equivalent to the size of the vector. In this situation, component-wise addition and multiplication of vector elements can be done simultaneously. Even when the size of the vector is larger than the number of processor elements available, the speedup is immense. There are two types of SIMD architectures. The first is the True SIMD and the second is the Pipelined SIMD.
  • Furthermore, concerning FIG. 3, the program is a data-parallel program, and is shown in the rightmost program representation. The instructions in a data-parallel program operate on data structures that are vectors, rather than scalars. Data-parallel programs can be either of the ILP nature, or may be a data-dependence chain.
  • The exemplary embodiments of the present invention provide a mechanism to “morph” a computer processor complex, each element of which is designed and optimized to perform work of one kind, into a complex, which can, with relatively high efficiency, perform another kind of work. In doing so, the processor complex transforms itself, on demand, into a single processing structure. Each pair of cores on the TLP chip is connected with each other using a SuperROB (super-instruction re-order buffer). The concept of SuperROB is an extension of the re-order buffer (ROB) used in modern ILP processors.
  • The SuperROB is shown as a queue 44 in FIG. 4. The top portion of FIG. 4 is a TLP chip 40 and the bottom portion of FIG. 4 is a TLP & ILP chip 42 configuration. The basic idea is that when presented with an ILP program, the two cores transform themselves into behaving as one. Therefore, instructions are supplied to the two cores by means of the SuperROB and the state of each instruction is captured in a single entry in the SuperROB. Also, the architected state of the program is captured in the register file of one of the two cores. The SuperROB thus is a mechanism of global communication of program values, and a mechanism to expose, explore, and exploit the instruction-level parallelism inherent in an ILP program. The plurality of cores supplied for the purposes of TLP are combined in an innovative fashion to also target ILP programs.
  • FIG. 5 illustrates an in-order core for TLP workloads. FIG. 5 depicts an instruction memory 50, instruction data 52, stored data 54, “data memory” data 56, and a data memory 58. In FIG. 5, there are several semi-independent chains of dependences that contain load instructions. Key data dependence paths that the processor optimizes are compute-compute, load-to-use, and compare-to-branch latencies. Furthermore, the in-order processor comprises multiple execution pipelines, there is no register renaming in the processor pipeline, and no mechanism to enforce orderly completion of instructions to maintain sanctity of architectural state. Thus, the instructions are not issued out of order.
  • The out-of-order instruction processing in OOP necessitates a mechanism to store the instructions in the original program order. If a temporally later instruction causes an exception before a temporally earlier instruction, then the exception must be withheld from recognition until the temporally earlier instruction has completed execution and updated the architected state as appropriate. To help alleviate this problem, a larger number of instructions are stored in program order in a buffer called the re-order buffer to allow precise exception handling. While precise exception handling is the primary motivation behind having a reorder buffer, it has also been used to find more instructions that are not dependent upon each other. The size of reorder buffers has been growing in most modern commercial computer architectures with some processors able to store as many as 126 instructions in-flight. The reason for increasing the size of the reorder buffer is that spatially related code also tends to be temporally related in terms of execution (with the possible exclusion of arrays of complex structures and linked lists). These instructions also have a tendency to depend upon the outcome of prior instructions. With a CPU's ever increasing amount of required code, the only current way to find and accommodate the execution of more independent instructions has been to increase the size of the reorder buffer. However, using this technique has achieved a rather impressive downturn in the rate of increased performance and in fact has been showing diminishing returns. It is now taking more and more transistors to achieve the same rate of performance increase. Instead of focusing intently upon uniprocessor ILP extraction, it is desired to focus on super re-order buffers that may co-locate a plurality of buffers within a superstructure.
  • FIG. 6 illustrates one example of a Super Re-Order Buffer (SuperROB). FIG. 6 depicts a first instruction memory 60, a first TLP core 62, a first data memory 64, a SuperROB 66, a second instruction memory 68, a second TLP core 70, and a second data memory 72. The SuperROB architecture provides for ILP extraction on demand, it operates across a plurality of TLP cores, it allows for global communication, and it allows for efficient ILP exposition, extraction, and exploitation. FIG. 6 shows two TLP cores that are separated by a buffer (SuperROB). The SuperROB acts as the communication mechanism between the two TLP cores. When the processor is in TLP mode, then the SuperROB is turned off. When the processor is in ILP mode, then the SuperROB is turned on.
  • All contemporary dynamically scheduled processors support register renaming to cope with false data dependences. One of the ways to implement register renaming is to use the slots within the Reorder Buffer (ROB) as physical registers. In such designs, the ROB is a large multi-ported structure that occupies a significant portion of the die area and dissipates a sizable fraction of the total chip power. The heavily ported ROB is also likely to have a large delay that can limit the processor clock rate. However, by utilizing a SuperROB these delays may be minimized.
  • The method of using a reorder buffer for committing (retiring) instructions in sequence in an out of order processor has been fundamental to out of order processor design. In the case of a complex instruction set computer (CISC) architecture complex instructions are cracked (mapped) into sequences of primitive instructions. Nullification in case of an exception is a problem for these instructions, because the exception may occur late in the sequence of primitive instructions.
  • FIG. 7 illustrates one example of a SuperROB operated in the TLP workload mode and FIG. 8 illustrates one example of a SuperROB operated in the ILP workload mode. As noted above, in the TLP mode, the SuperROB is turned off. However, in the ILP mode, the SuperROB is turned on in order to facilitate instruction management. Also, received instructions are received from at least two of the plurality of processors from a single input source. In other words, renaming based on a SuperROB uses a physical register file that is the same size as the architectural register file, together with a set of registers arranged as a queue data structure. This facilitates faster processing. Moreover, the cache may be accessed every alternate fetch cycle, thus providing even greater processing performance. The ICache is shared, and one of the cores (which one is a matter of convention) places requests for the two subsequent cache lines to fetch instructions from. “Next line A” is sent to the first core, and the ‘next-next line B’ is sent to the other core. The fetch logic for each of the two cores places their instructions in the SuperROB in the original program order. After that point in time, the available instructions in the SuperROB could be picked up and worked on by either of the two cores.
  • In FIG. 8, as instructions are issued, they are assigned entries for any results they may generate at the tail of the SuperROB. That is, a place is reserved in the queue. Logical order of instructions within this buffer is maintained so that if four instructions are issued, e.g., i to i+3 at once, i is put in the reorder buffer first, followed by i+1, i+2 and i+3. As instruction execution proceeds, the assigned entry is ultimately filled in by a value, representing the result of the instruction. When entries reach the head of the SuperROB, provided they have been filled in with their actual intended result, they are removed, and each value is written to its intended architectural register. If the value is not yet available, then it is required for the user to wait until the value does become available. Because instructions take variable times to execute, and because they may be executed out of program order, it may be found that the SuperROB entry at the head of the queue is still waiting to be filled, while later entries are ready. In this case, all entries behind the unfilled slot must stay in the SuperROB until the head instruction completes its operations.
  • FIG. 9 shows the structure of each entry in the SuperROB. Each entry has a back or front pointer field, which is used by the ROB management hardware as a circular queue of ROB entries. That is followed by a set of status flags per entry, which indicate if the entry is being worked on by a core, or is available to be worked on. Next are two fields used exclusively to hold the prediction and the outcome of branch instructions. Next is a series of three fields, two for source register operands in the instruction, and one for the target register operand. Each source register field holds the id or number of the ROB entry that produced the value, which is useful in determining if the instruction is ready for execution. The target register field holds the architected register name into which the target register value must be committed when the instruction is retired. The value of the operand is also held along with each register field. For a store instruction which has no target register operand, the target register value is used to hold the datum to be stored in memory. More fields could be added on a per-instruction basis, and managed as needed.
  • Therefore, the processor, via the SuperROB, becomes a pure dataflow micro-architecture, where each entry in the SuperROB holds all the data pertaining to a single instruction in flight. The data contained may be source register values (as and when available), target register values (as values are produced), memory store values (for store instructions), and branch outcome values (predicates). The instructions are fetched in program order by using a protocol followed by two TLP front-ends, as illustrated in FIG. 9. One SuperROB entry is allocated for each decoded instruction. Also, each fetched instruction could be from separate ICaches, Trace Cache or other cache types. As further shown in FIG. 9, the decode logic of each pipeline operates independently of each other. Thus, both pipelines of cores A and B of FIG. 8 monitor the SuperROB, and pick up the work, and do the work when work is available. The results of the work are written back to the appropriate SuperROB entry.
  • Moreover, independently decoupled state machines operate in a purely dataflow fashion. In other words, a state machine decodes instructions to rename its source operands (to the temporally preceding SuperROB entry numbers, or fetch values from architected registers). The state machine also fetches values from SuperROB entries and updates the sources of the waiting instructions. The state machine also marks the instructions that are ready to be executed and dispatches instructions to the execution backend. The backend logic updates the appropriate SuperROB entry upon completion. As a result, there are no separate bypasses between the two independent execution backends and all the communication between the two pipelines is carried out via the SuperROB.
  • In addition, the exemplary embodiments of the present application are not limited to the structures in FIGS. 1-9. In other words, more than two cores could be connected to ‘morph’ the processor. Also, it is possible to hold actual values in a separate future/history file (with or without a separate architected register file). The state machine may also fetch instructions every alternate cycle from the Icaches or from an Ifetch buffer. Therefore, there may be variations based on pre-decode information that is available from the ICaches. Also, a split of the SuperROB is possible. The split may be for a register data-flow and for a memory data-flow (separate load/store associative lookup queue). Furthermore, variations on the contents of SuperROB entries is allowed, variations based on the basic nature of the TLP core are allowed, and variations based on Simultaneous Multithreading Processor (SMT) or not-SMT is allowed.
  • Referring to FIG. 10, a manner in which two cores, individually designed for efficient execution of data-dependence chain code, are connected to each other by means of the SuperROB structure. The SuperROB is a queue of instructions, with each entry also holding other information about the instruction. The computer system operates in either TLP (thread-level parallel) mode, or ILP mode. When in TLP mode, it is understood that the programs to be executed on the system are data-dependence chains programs. When in ILP mode, the programs to be executed on the system are ILP programs. The SuperROB is disabled when the computer is in TLP mode, and it is enabled when the computer is in ILP mode. Change of mode could be carried out in a variety of ways, for example, under explicit control of the programmer, or under implicit control of the OS or the HyperVisor, or under pure hardware control with the processor having monitoring hardware that watches the amount of dependence nature of instructions temporally and switches the mode from TLP to ILP or vice-versa.
  • Referring to FIG. 11, in the ILP mode, the instruction fetch logic is shown working with a single trace cache line A (prediction for which is supplied by one of the two cores). The trace cache now holds a single ILP program (which is unified rather than shared as in the TLP mode). Parts of the trace line are placed in SuperROB by one core, and the remaining part is placed by the other core.
  • Referring to FIG. 12, the SuperROB is shown as a series of queues, the previous queue feeding the next, as a physical implementation of a logically single SuperROB structure. This could work with a regular ICache or a trace cache.
  • Moreover, instructions are placed in the SuperROB, in program order, by one or both the IFetch stages of logic connected to it. Once placed in the SuperROB, the Decode stages of logic from both the cores carry out the task of instruction decode, and update the status of instructions. The Issue logic stages from the two cores pick up decodes instructions, and issue them to their respective execution back-ends. One of the two register files is used to hold the architected state of the program, which one, is decided by convention. The other one is not used. When an instruction completes execution on either of the Execute logic stages or the Access logic stages, the instruction's status is updated in the SuperROB. This general manner of execution continues until the mode of the machine remains the ILP mode. It is to be generally understood that the ICache shown in the figure above holds a single program for execution when in ILP mode.
  • The capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
  • As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
  • Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
  • The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
  • While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow.
  • These claims should be construed to maintain the proper protection for the invention first described.

Claims (20)

1. A method for operating a plurality of processors that each includes an execution pipeline for processing dependence chains, the method comprising:
configuring the plurality of processors to execute the dependence chains on execution pipelines;
implementing a Super Re-Order Buffer (SuperROB) in which received instructions are re-ordered for out-of-order execution when at least one of the plurality of processors is in an Instruction Level Parallelism (ILP) mode and at least one of the plurality of processors has a Thread Level Parallelism (TLP) core;
detecting an imbalance in a dispatch of instructions of a first dependence chain compared to a dispatch of instructions of a second dependence chain with respect to dependence chain priority;
determining a source of the imbalance; and
activating the ILP mode when the source of the imbalance has been determined.
2. The method of claim 1, wherein the plurality of processors are configured for load-to-use, compute-to-compute, compute-to-compare, and load-to-compare-to-branch latencies.
3. The method of claim 1, wherein the plurality of processors are configured for high-throughput TLP-oriented applications.
4. The method of claim 1, wherein the plurality of processors are configured for ILP extraction on demand.
5. The method of claim 1, wherein each of the plurality of processors has a plurality of execution pipelines.
6. The method of claim 1, wherein the SuperROB operates across a plurality of TLP cores.
7. The method of claim 1, wherein the SuperROB allows for global communication.
8. The method of claim 1, wherein the SuperROB allows for ILP exposition, extraction, and exploitation.
9. The method of claim 1, wherein the SuperROB is deactivated whenever each of the plurality of processors are in TLP mode.
10. The method of claim 1, wherein entries in the SuperROB are in a non-architected state.
11. The method of claim 10, wherein the entries in the SuperROB are source register values, target register values, memory store values, and branch outcome values.
12. The method of claim 10, wherein each of the entries in the SuperROB is allocated for each decoded instruction.
13. The method of claim 1, wherein each of the received instructions are fetched from separate caches.
14. The method of claim 1, wherein each of the received instructions is fetched from an instruction cache or from a portion of a trace cache line or a normal cache line and is placed into the SuperROB by one of a plurality of instruction fetch logic elements of the plurality of processors.
15. The method of claim 1, wherein the execution pipelines of each of the plurality of processors monitors the SuperROB.
16. The method of claim 1, wherein the SuperROB is split into a first region for register data-flow and a second region for memory data-flow.
17. The method of claim 1, wherein the SuperROB is split into a first region for instruction fetch, a second region for instruction decode and dispatch, and a third region for instruction issue to execution units and instruction execution.
18. The method of claim 1, wherein the received instructions are received from at least two of the plurality of processors from a single input source.
19. (canceled)
20. (canceled)
US11/531,042 2006-09-12 2006-09-12 Computer processing system employing an instruction reorder buffer Expired - Fee Related US7395416B1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US11/531,042 US7395416B1 (en) 2006-09-12 2006-09-12 Computer processing system employing an instruction reorder buffer
US12/127,844 US20090055633A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer
US12/127,845 US20080229077A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/531,042 US7395416B1 (en) 2006-09-12 2006-09-12 Computer processing system employing an instruction reorder buffer

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US12/127,844 Continuation US20090055633A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer
US12/127,845 Continuation US20080229077A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer

Publications (2)

Publication Number Publication Date
US7395416B1 US7395416B1 (en) 2008-07-01
US20080162890A1 true US20080162890A1 (en) 2008-07-03

Family

ID=39561219

Family Applications (3)

Application Number Title Priority Date Filing Date
US11/531,042 Expired - Fee Related US7395416B1 (en) 2006-09-12 2006-09-12 Computer processing system employing an instruction reorder buffer
US12/127,844 Abandoned US20090055633A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer
US12/127,845 Abandoned US20080229077A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer

Family Applications After (2)

Application Number Title Priority Date Filing Date
US12/127,844 Abandoned US20090055633A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer
US12/127,845 Abandoned US20080229077A1 (en) 2006-09-12 2008-05-28 Computer processing system employing an instruction reorder buffer

Country Status (1)

Country Link
US (3) US7395416B1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090210676A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for the Scheduling of Load Instructions Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210667A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210670A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Arithmetic Instructions
US20090210669A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Floating-Point Instructions
US20090210671A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Store Instructions
US20090210677A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210665A1 (en) * 2008-02-19 2009-08-20 Bradford Jeffrey P System and Method for a Group Priority Issue Schema for a Cascaded Pipeline
US20090210672A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Resolving Issue Conflicts of Load Instructions
US20090210668A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210674A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Branch Instructions
US20090210666A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Resolving Issue Conflicts of Load Instructions
US20090210673A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Compare Instructions
US8768898B1 (en) * 2007-04-26 2014-07-01 Netapp, Inc. Performing direct data manipulation on a storage device

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8386712B2 (en) * 2006-10-04 2013-02-26 International Business Machines Corporation Structure for supporting simultaneous storage of trace and standard cache lines
US7644233B2 (en) * 2006-10-04 2010-01-05 International Business Machines Corporation Apparatus and method for supporting simultaneous storage of trace and standard cache lines
JP5125659B2 (en) * 2008-03-24 2013-01-23 富士通株式会社 Information processing apparatus, information processing method, and computer program
US8310494B2 (en) * 2008-09-30 2012-11-13 Apple Inc. Method for reducing graphics rendering failures
US9104399B2 (en) * 2009-12-23 2015-08-11 International Business Machines Corporation Dual issuing of complex instruction set instructions
WO2013101138A1 (en) * 2011-12-30 2013-07-04 Intel Corporation Identifying and prioritizing critical instructions within processor circuitry
CN104424158A (en) * 2013-08-19 2015-03-18 上海芯豪微电子有限公司 General unit-based high-performance processor system and method
US9495159B2 (en) * 2013-09-27 2016-11-15 Intel Corporation Two level re-order buffer
GB2539041B (en) * 2015-06-05 2019-10-02 Advanced Risc Mach Ltd Mode switching in dependence upon a number of active threads
US10282205B2 (en) * 2015-10-14 2019-05-07 International Business Machines Corporation Method and apparatus for execution of threads on processing slices using a history buffer for restoring architected register data via issued instructions
US10073699B2 (en) * 2015-10-14 2018-09-11 International Business Machines Corporation Processing instructions in parallel with waw hazards and via a distributed history buffer in a microprocessor having a multi-execution slice architecture
US10289415B2 (en) * 2015-10-14 2019-05-14 International Business Machines Corporation Method and apparatus for execution of threads on processing slices using a history buffer for recording architected register data
US9870039B2 (en) 2015-12-15 2018-01-16 International Business Machines Corporation Reducing power consumption in a multi-slice computer processor
CN111783954B (en) * 2020-06-30 2023-05-02 安徽寒武纪信息科技有限公司 Method, electronic device and storage medium for determining performance of neural network
CN112559054B (en) * 2020-12-22 2022-02-01 上海壁仞智能科技有限公司 Method and computing system for synchronizing instructions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6112019A (en) * 1995-06-12 2000-08-29 Georgia Tech Research Corp. Distributed instruction queue
US20050050303A1 (en) * 2003-06-30 2005-03-03 Roni Rosner Hierarchical reorder buffers for controlling speculative execution in a multi-cluster system
US20050071613A1 (en) * 2003-09-30 2005-03-31 Desylva Chuck Instruction mix monitor
US7155600B2 (en) * 2003-04-24 2006-12-26 International Business Machines Corporation Method and logical apparatus for switching between single-threaded and multi-threaded execution states in a simultaneous multi-threaded (SMT) processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6112019A (en) * 1995-06-12 2000-08-29 Georgia Tech Research Corp. Distributed instruction queue
US6311261B1 (en) * 1995-06-12 2001-10-30 Georgia Tech Research Corporation Apparatus and method for improving superscalar processors
US7155600B2 (en) * 2003-04-24 2006-12-26 International Business Machines Corporation Method and logical apparatus for switching between single-threaded and multi-threaded execution states in a simultaneous multi-threaded (SMT) processor
US20050050303A1 (en) * 2003-06-30 2005-03-03 Roni Rosner Hierarchical reorder buffers for controlling speculative execution in a multi-cluster system
US20050071613A1 (en) * 2003-09-30 2005-03-31 Desylva Chuck Instruction mix monitor

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768898B1 (en) * 2007-04-26 2014-07-01 Netapp, Inc. Performing direct data manipulation on a storage device
US20090210666A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Resolving Issue Conflicts of Load Instructions
US8095779B2 (en) 2008-02-19 2012-01-10 International Business Machines Corporation System and method for optimization within a group priority issue schema for a cascaded pipeline
US20090210669A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Floating-Point Instructions
US20090210671A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Store Instructions
US20090210677A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210665A1 (en) * 2008-02-19 2009-08-20 Bradford Jeffrey P System and Method for a Group Priority Issue Schema for a Cascaded Pipeline
US20090210672A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Resolving Issue Conflicts of Load Instructions
US20090210673A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Compare Instructions
US20090210667A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US20090210670A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Arithmetic Instructions
US20090210668A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Optimization Within a Group Priority Issue Schema for a Cascaded Pipeline
US7865700B2 (en) 2008-02-19 2011-01-04 International Business Machines Corporation System and method for prioritizing store instructions
US7870368B2 (en) 2008-02-19 2011-01-11 International Business Machines Corporation System and method for prioritizing branch instructions
US7877579B2 (en) 2008-02-19 2011-01-25 International Business Machines Corporation System and method for prioritizing compare instructions
US7882335B2 (en) 2008-02-19 2011-02-01 International Business Machines Corporation System and method for the scheduling of load instructions within a group priority issue schema for a cascaded pipeline
US7984270B2 (en) * 2008-02-19 2011-07-19 International Business Machines Corporation System and method for prioritizing arithmetic instructions
US7996654B2 (en) * 2008-02-19 2011-08-09 International Business Machines Corporation System and method for optimization within a group priority issue schema for a cascaded pipeline
US20090210676A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for the Scheduling of Load Instructions Within a Group Priority Issue Schema for a Cascaded Pipeline
US8108654B2 (en) 2008-02-19 2012-01-31 International Business Machines Corporation System and method for a group priority issue schema for a cascaded pipeline
US20090210674A1 (en) * 2008-02-19 2009-08-20 Luick David A System and Method for Prioritizing Branch Instructions

Also Published As

Publication number Publication date
US7395416B1 (en) 2008-07-01
US20090055633A1 (en) 2009-02-26
US20080229077A1 (en) 2008-09-18

Similar Documents

Publication Publication Date Title
US7395416B1 (en) Computer processing system employing an instruction reorder buffer
CN108027807B (en) Block-based processor core topology register
CN108027771B (en) Block-based processor core composition register
US7055021B2 (en) Out-of-order processor that reduces mis-speculation using a replay scoreboard
US8335911B2 (en) Dynamic allocation of resources in a threaded, heterogeneous processor
US20170371660A1 (en) Load-store queue for multiple processor cores
US20030149865A1 (en) Processor that eliminates mis-steering instruction fetch resulting from incorrect resolution of mis-speculated branch instructions
US20170371659A1 (en) Load-store queue for block-based processor
US11726912B2 (en) Coupling wide memory interface to wide write back paths
US11829762B2 (en) Time-resource matrix for a microprocessor with time counter for statically dispatching instructions
US20230393852A1 (en) Vector coprocessor with time counter for statically dispatching instructions
US20230350679A1 (en) Microprocessor with odd and even register sets
US20230350680A1 (en) Microprocessor with baseline and extended register sets
US11829187B2 (en) Microprocessor with time counter for statically dispatching instructions
US11954491B2 (en) Multi-threading microprocessor with a time counter for statically dispatching instructions
US20240020120A1 (en) Vector processor with vector data buffer
US20230273796A1 (en) Microprocessor with time counter for statically dispatching instructions with phantom registers
US20230350685A1 (en) Method and apparatus for a scalable microprocessor with time counter
US20230342153A1 (en) Microprocessor with a time counter for statically dispatching extended instructions
US20230342148A1 (en) Microprocessor with non-cacheable memory load prediction
US20230315474A1 (en) Microprocessor with apparatus and method for replaying instructions
Lee et al. Parallel in-order execution architecture for low-power processor
Dutta-Roy Instructional Level Parallelism
Nemirovsky et al. Implementations of Multithreaded Processors
Nemirovsky et al. Fine-Grain Multithreading

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SATHAYE, SUMEDH W.;REEL/FRAME:018235/0123

Effective date: 20060911

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REMI Maintenance fee reminder mailed
FPAY Fee payment

Year of fee payment: 4

SULP Surcharge for late payment
REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20160701