« PreviousContinue »
COMPUTER PRODUCT FOR PRECISE
ARCHITECTURAL UPDATE IN AN OUT-OF-
CROSS-REFERENCES TO RELATED 5
The subject matter of the present application is related to that of co-pending U.S. patent application Ser. No. 08/881, 958 identified as Docket No. P2345/37178.830071.000 for AN APPARATUS FOR HANDLING ALIASED 10 FLOATING-POINT REGISTERS IN AN OUT-OFORDER PROCESSOR filed concurrently herewith by Ramesh Panwar; Ser. No. 08/881,729 identified as Docket No. P2346/37178.830072.000 for APPARATUS FOR PRECISE ARCHITECTURAL UPDATE IN AN OUT-OF- 15 ORDER PROCESSOR filed concurrently herewith by Ramesh Panwar and Arjun Prabhu; Ser. No. 08/881,726 identified as Docket No. P2348/37178.830073.000 for AN APPARATUS FOR NON-INTRUSIVE CACHE FILLS AND HANDLING OF LOAD MISSES filed concurrently 20 herewith by Ramesh Panwar and Ricky C. Hetherington; Ser. No. 08/882,908 identified as Docket No. P2349/ 37178.830074.000 for AN APPARATUS FOR HANDLING COMPLEX INSTRUCTIONS IN AN OUT-OF-ORDER PROCESSOR filed concurrently herewith by Ramesh Pan- 25 war and Dani Y. Dakhil; Ser. No. 08/881,173 identified as Docket No. P2350/37178.830075.000 for AN APPARATUS FOR ENFORCING TRUE DEPENDENCIES IN AN OUTOF-ORDER PROCESSOR filed concurrently herewith by Ramesh Panwar and Dani Y. Dakhil; Ser. No. 08/881,145 30 identified as Docket No. P2351/37178.830076.000 for APPARATUS FOR DYNAMICALLY RECONFIGURING A PROCESSOR filed concurrently herewith by Ramesh Panwar and Ricky C. Hetherington; Ser. No. 08/881,732 identified as Docket No. P2353/37178.830077.000 for 35 APPARATUS FOR ENSURING FAIRNESS OF SHARED EXECUTION RESOURCES AMONGST MULTIPLE PROCESSES EXECUTING ON A SINGLE PROCESSOR filed concurrently herewith by Ramesh Panwar and Joseph I. Chamdani; Ser. No. 08/882,175 identified as Docket No. 40 P2355/37178.830078.000 for SYSTEM FOR EFFICIENT IMPLEMENTATION OF MULTI-PORTED LOGIC FIFO STRUCTURES IN A PROCESSOR filed concurrently herewith by Ramesh Panwar; Ser. No. 08/882,311 identified as Docket No. P2365/37178.830080.000 for AN APPARATUS 45 FOR MAINTAINING PROGRAM CORRECTNESS WHILE ALLOWING LOADS TO BE BOOSTED PAST STORES IN AN OUT-OF-ORDER MACHINE filed concurrently herewith by Ramesh Panwar, P. K. Chidambaran and Ricky C. Hetherington; Ser. No. 08/881,731 identified 50 as Docket No. P2369/37178.830081.000 for APPARATUS FOR TRACKING PIPELINE RESOURCES IN A SUPERSCALAR PROCESSOR filed concurrently herewith by Ramesh Panwar; Ser. No. 08/882,525 identified as Docket No. P2370/37178.830082.000 for AN APPARATUS FOR 55 RESTRAINING OVER-EAGER LOAD BOOSTING IN AN OUT-OF-ORDER MACHINE filed concurrently herewith by Ramesh Panwar and Ricky C. Hetherington; Ser. No. 08/882,220 identified as Docket No. P2371/ 37178.830083.000 for AN APPARATUS FOR HANDLING 60 REGISTER WINDOWS IN AN OUT-OF-ORDER PROCESSOR filed concurrently herewith by Ramesh Panwar and Dani Y. Dakhil; Ser. No. 08/881,847 identified as Docket No. P2372/37178.830084.000 for AN APPARATUS FOR DELIVERING PRECISE TRAPS AND INTER- 65 RUPTS IN AN OUT-OF-ORDER PROCESSOR filed concurrently herewith by Ramesh Panwar; Ser. No. 08/881,728
identified as Docket No. P2398/37178.830085.000 for NON-BLOCKING HIERARCHICAL CACHE THROTTLE filed concurrently herewith by Ricky C. Hetherington and Thomas M. Wicki; Ser. No. 08/881,727 identified as Docket No. P2406/37178.830086.000 for NONTHRASHABLE NON-BLOCKING HIERARCHICAL CACHE filed concurrently herewith by Ricky C. Hetherington, Sharad Mehrotra and Ramesh Panwar; Ser. No. 08/881,065 identified as Docket No. P2408/ 37178.830087.000 for IN-LINE BANK CONFLICT DETECTION AND RESOLUTION IN A MULTI-PORTED NON-BLOCKING CACHE filed concurrently herewith by Ricky C. Hetherington, Sharad Mehrotra and Ramesh Panwar; and Ser. No. 08/882,613 identified as Docket No. P2434/37178.830088.000 for SYSTEM FOR THERMAL OVERLOAD DETECTION AND PREVENTION FOR AN INTEGRATED CIRCUIT PROCESSOR filed concurrently herewith by Ricky C. Hetherington and Ramesh Panwar, the disclosures of which applications are herein incorporated by this reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture providing precise state updates in an out-of-order machine.
2. Relevant Background
Early computer processors (also called microprocessors) included a central processing unit or instruction execution unit that executed only one instruction at a time. As used herein the term processor includes complex instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. In response to the need for improved performance several techniques have been used to extend the capabilities of these early processors including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution.
Pipelined architectures break the execution of instructions into a number of stages where each stage corresponds to one step in the execution of the instruction. Pipelined designs increase the rate at which instructions can be executed by allowing a new instruction to begin execution before a previous instruction is finished executing. Pipelined architectures have been extended to "superpipelined" or "extended pipeline" architectures where each execution pipeline is broken down into even smaller stages (i.e., microinstruction granularity is increased). Superpipelining increases the number of instructions that can be executed in the pipeline at any given time.
"Superscalar" processors generally refer to a class of microprocessor architectures that include multiple pipelines that process instructions in parallel. Superscalar processors typically execute more than one instruction per clock cycle, on average. Superscalar processors allow parallel instruction execution in two or more instruction execution pipelines. The number of instructions that may be processed is increased due to parallel execution. Each of the execution pipelines may have differing number of stages. Some of the pipelines may be optimized for specialized functions such as integer operations or floating point operations, and in some cases execution pipelines are optimized for processing graphic, multimedia, or complex math instructions.
The goal of superscalar and superpipeline processors is to execute multiple instructions per cycle (IPC). Instructionlevel parallelism (ILP) available in programs can be exploited to realize this goal, however, this potential parallelism requires that instructions be dispatched for execution at a sufficient rate. Conditional branching instructions create a problem for instruction fetching because the instruction 5 fetch unit (IFU) cannot know with certainty which instructions to fetch until the conditional branch instruction is resolved. Also, when a branch is detected, the target address of the instructions following the branch must be predicted to supply those instructions for execution.
Recent processor architectures use a branch prediction unit to predict the outcome of branch instructions allowing the fetch unit to fetch subsequent instructions according to the predicted outcome. Branch prediction techniques are known that can predict branch outcomes with greater than 15 95% accuracy. These instructions are "speculatively executed" to allow the processor to make forward progress during the time the branch instruction is resolved. When the prediction is correct, the results of the speculative execution can be used as correct results, greatly improving processor 20 speed and efficiency. When the prediction is incorrect, the completely or partially executed instructions must be flushed from the processor and execution of the correct branch initiated.
Early processors executed instructions in an order deter- 25 mined by the compiled machine-language program running on the processor and so are referred to as "in-order±5 or "sequential" processors. In superscalar processors multiple pipelines can simultaneously process instructions only when there are no data dependencies between the instructions in 30 each pipeline. Data dependencies cause one or more pipelines to "stall" waiting for the dependent data to become available. This is further complicated in superpipelined processors where, because many instructions exist simultaneously in each pipeline, the potential quantity of data 35 dependencies is large. Hence, greater parallelism and higher performance are achieved by "out-of-order" processors that include multiple pipelines in which instructions are processed in parallel in any efficient order that takes advantage of opportunities for parallel processing that may be provided 40 by the instruction code.
Although out-of-order processing greatly improves throughput, it also increases complexity as compared to simple sequential processors. One area of increased complexity relates to state recovery following an unpredicted 45 change of instruction flow. At any time during execution many instructions may be in the execution stage, some awaiting scheduling, some being executed, and some having completed execution but awaiting retirement. In the event that a change of instruction flow is detected during execution 50 of an instruction, the instructions preceding that instruction must proceed to retirement, but the instructions following should be discarded. In other words, the state of the processor at the time of the change in instruction flow must be recovered in order for execution to continue properly. State 55 recovery restores the pipeline to a state that it would have existed had the mispredicted instructions not been processed. Hence, one particular problem with superscalar processors is state recovery following an unexpected change of instruction flow caused by internal or external events such 60 as interrupts, exceptions, and branch instructions.
Out-of-order execution can result in conflicts between instructions attempting to use the same registers even though these instructions are otherwise independent. Instructions may produce two general types of actions when executed: 65 storing results that are directed to an architectural register location and setting condition codes (CCs) that are directed
to one or more architectural condition code registers (CCRs). The results and CCs for any instruction that is speculatively executed cannot be stored in the architectural registers until all conditions prior to the instruction are resolved. To overcome this problem in prior processors, new register locations called "rename registers" are allocated for every new result produced (i.e., for every instruction that loads data into a register) in a process called "register renaming". A similar technique is used to store the CC set by a speculatively executed instruction. One difficulty with this technique is that because the speculative CC is stored separately from the speculative result, the bookkeeping logic necessary to handle the results and CC sets with precision is cumbersome and can slow processor throughput.
Using register renaming, an instruction identifying the original register for the purpose of reading its value obtains instead the value of the newly allocated rename register. Thus, the hardware renames the original register identifier in the instruction to identify the new register and the correct stored value. The same register identifier in several different instructions may access different hardware registers depending on the locations of the register references with respect to the register assignments. Although widely used, register renaming requires use of a tracking table having entries for each register in the processor indicating, among other things, the instruction identification and the particular instruction assigned to that register. This method of register renaming becomes unwieldy for larger designs with hundreds or thousands of registers. Also, because tracking tables become slower to access as they increase in size, large tracking tables may become a clock frequency limitation.
When an error occurs in the execution of a microinstruction an "exception" is generated. Typical exceptions include "faults", "traps" and "interrupts". These events cause updates of the macroarchitectural or microarchitectural state of the processor in response to the condition detected by invoking software or hardware instruction routines called exception handlers". Exception handling is complicated in a multiple pipeline machine. Exceptions may be handled in either a precise or imprecise manner. Precise exception handling allows the programmer to know exactly where an error occurred and continue processing without having to abort the program because the appearance of sequential execution of instructions is preserved. In contrast, imprecise exception handling provides minimal information to the programmer, none of which is guaranteed to be correct, and may require aborting execution of the program. Thus, in most applications, precise exception handling is preferred.
SUMMARY OF THE INVENTION
The present invention involves a processor including at least one execution unit generating out-of-order results and out-of-order condition codes. Precise architectural state of the processor is maintained by providing a results buffer having a number of slots and providing a condition code buffer having the same number of slots as the results buffer, each slot in the condition code buffer in one-to-one correspondence with a slot in the results buffer. Each live instruction in the processor is assigned a slot in the results buffer and the condition code buffer. Each speculative result produced by the execution units is stored in the assigned slot in the results buffer. When an instruction is retired, the results for that instruction are transferred to an architectural result register and any condition codes generated by that instruction are transferred to an architectural condition code register.
The present invention involves a system and apparatus for maintaining precise architectural state primarily through