US6981129B1 - Breaking replay dependency loops in a processor using a rescheduled replay queue - Google Patents

Breaking replay dependency loops in a processor using a rescheduled replay queue Download PDF

Info

Publication number
US6981129B1
US6981129B1 US09/705,668 US70566800A US6981129B1 US 6981129 B1 US6981129 B1 US 6981129B1 US 70566800 A US70566800 A US 70566800A US 6981129 B1 US6981129 B1 US 6981129B1
Authority
US
United States
Prior art keywords
instruction
instructions
processor
coupled
replay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US09/705,668
Inventor
Darrell D. Boggs
Douglas M. Carmean
Per H. Hammarlund
Francis X. McKeen
David J. Sager
Ronak Singhal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US09/705,668 priority Critical patent/US6981129B1/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SINGHAL, RONAK, BOGGS, DARRELL D., CARMEAN, DOUGLAS M., HAMMARLUND, PER H., MCKEEN, FRANCIS X., SAGER, DAVID J.
Priority to EP01986210A priority patent/EP1334426A2/en
Priority to PCT/US2001/050735 priority patent/WO2002039269A2/en
Priority to CNB018198961A priority patent/CN1294484C/en
Priority to AU2002236668A priority patent/AU2002236668A1/en
Application granted granted Critical
Publication of US6981129B1 publication Critical patent/US6981129B1/en
Adjusted expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3861Recovery, e.g. branch miss-prediction, exception handling

Definitions

  • the invention generally relates to processors, and in particular to processors having a replay system for data speculation.
  • the second instruction will not be executed until the first instruction has properly executed.
  • the second instruction may not be dispatched to the processor until a signal is received stating that the first instruction has completed execution.
  • dependent instructions may be selectively chosen and speculatively executed in an effort to anticipate the results needed to increase the throughput of the processor and decrease overall execution time.
  • Data speculation may involve speculating that data retrieved from a cache memory is valid. Processing proceeds on the assumption that data retrieved from the cache is good. However, when the data in the cache is invalid, the results of the execution of the speculatively executed instructions are disregarded, and the processor backs up to re-execute the instruction that was executed. Stated another way, data speculation assumes that data in a cache memory are correct, that is, that the cache memory contains the result from those instructions on which the present instruction is dependent.
  • Data speculation may involve speculating that data from the execution of an instruction on which the present instruction is dependent will be stored in a location in cache memory such that the data in the cache memory will be valid by the time the instruction attempts to access the location in cache memory.
  • the dependent instruction is dependent on the result of a load of the result of the instruction on which it is dependent. When the load misses the cache, the dependent instruction must be re-executed.
  • instruction 2 may be executed in parallel such that instruction 2 speculatively accesses the value stored in a location in cache memory where the result of instruction 1 will be stored. In this way, instruction 2 executes assuming a cache hit. If the value of the contents of the cache memory is valid, then the execution of instruction 1 has completed. If instruction 2 is successfully speculatively executed in advance of the completion of instruction 1 , then, rather than execute instruction 2 at the completion of instruction 1 , a simple and quick check may be made to confirm that the speculative execution was successful. In this way, processors increase their execution speed and throughput by executing instructions in advance by speculatively executing instruction based on the assumption that needed data will be available in cache memory.
  • FIG. 1 is a diagram illustrating a portion of a processor according to an embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating a processor including an embodiment of the present invention.
  • FIG. 3 is a flow chart illustrating a method of instruction processing according to an embodiment of the present invention.
  • a processor that speculatively schedules instructions for execution and allows for replay of unsuccessfully executed instructions. Speculative scheduling allows the scheduling latency for instructions to be reduced.
  • the replay system re-executes instructions that were not successfully executed when they were originally dispatched to an execution unit. An instruction is considered not successfully executed when the instruction is executed with bad input data, or an instruction whose output are bad due to a cache miss, etc. For example, a memory load instruction may not execute properly if there is a cache miss during execution, thereby requiring the instruction to be re-executed. In addition, all instructions dependent thereon must also be replayed.
  • a challenging aspect of such a replay system is the possibility for long latency instructions to circulate through the replay system and re-execute many times before executing successfully.
  • long latency instructions in certain circumstances, several conditions occur which must be resolved serially. When this occurs, each condition results in extra replays for all dependent instructions. This results in several instructions incurring several sequential cache misses.
  • This condition occurs when a chain of dependent instructions are replaying. In the chain, several instructions may each take a cache miss. This results in a cascading set of replays. Each instruction must replay an extra time for each miss in the dependency path. For example, a long latency instruction and instructions dependent on the long latency instruction may be re-executed multiple times until a cache hit occurs.
  • An example of a long latency instruction is a memory load instruction in which there is an L0 cache miss and an L1 cache miss (i.e., on-chip cache misses) on a first attempt at executing an instruction.
  • the execution unit may then retrieve the data from an external memory device across an external bus. This retrieval may be very time consuming, requiring several hundred clock cycles. Any unnecessary and repeated re-execution of this long latency load instruction before its source data has become available wastes valuable execution resources, prevents other instructions from executing, and increases overall processor latency.
  • a replay queue for storing the instructions. After unsuccessful execution of an instruction, the instruction, and instructions dependent thereon that have also unsuccessfully executed, are stored in the replay queue until the data the instruction requires returns, that is, the cache memory location for the data is valid. At this time, the instruction is considered ready for execution. For example, when the data for a memory load instruction returns from external memory, the memory load instruction may then be scheduled for execution, and any dependent instructions may then be scheduled after execution of the memory load instruction has completed.
  • FIG. 1 is a block diagram illustrating a portion of a processor according to an embodiment of the present invention.
  • Allocator/renamer 10 receives instructions from a front end (not shown). After allocating system resources for the instructions, the instructions are passed to replay queue 20 .
  • Replay queue 20 stores instructions in program order. Replay queue 20 may mark instructions as safe or unsafe based on whether the data required by the instruction is available. Instructions are then sent to scheduler 30 . Although only one scheduler 30 is depicted so as to simplify the description of the invention, multiple schedulers may be coupled to replay queue 20 .
  • Scheduler 30 may re-order the instructions to execute the instructions out of program order to achieve efficiencies inherent with data speculation.
  • scheduler 30 may include counter 32 .
  • Counter 32 may be used to maintain a counter for each instruction representing the number of times the instruction has been executed or replayed.
  • a counter may be included with replay queue 20 .
  • a counter may be paired with and travel with the instruction.
  • an on-chip memory device may be dedicated to keep track of the replay status of all pending instructions. In this embodiment, the on-chip counter may be accessed by replay queue 20 and/or scheduler 30 and/or checker 60 .
  • scheduler 30 uses the data dependencies of instructions and the latencies of instructions in determining the order of execution of instructions. That is, based on a combination of the data dependencies of various instructions and the anticipated execution time of instructions, that is, the latency of the instructions, the scheduler determines the order of execution of instructions. Scheduler 30 may refer to counter 32 to determine whether a maximum number of executions or replays has already occurred. In this way, excessive replays of instructions that are unsafe for replaying can be avoided, thus freeing up system resources for the execution of other instructions. Scheduler 30 passes instructions to execution unit 40 . Although only one execution unit 40 is depicted so as to simplify the description of the invention, multiple execution units may be coupled to multiple schedulers.
  • the scheduler may stagger instructions among a plurality of execution units such that execution of the instructions is performed in parallel, and out of order, in an attempt to match the data dependencies of instructions with the expected completion of other parallelly executing instructions on which the instruction is dependent.
  • a staging queue 50 may also be used to send instruction information in parallel with execution unit 40 in an unexecuted, delayed form. Although a staging queue is depicted, some form of staging is required and any other staging method may be used.
  • checker 60 determines whether execution was successful. In one embodiment, this may be achieved by analyzing the data dependency of the instruction, and whether a cache hit or cache miss occurred at the cache location of the needed data. More specifically, the checker checks whether replay is necessary by checking the input registers of the instruction to determine whether the data contained therein was valid. To determine whether replay is necessary, the checker may check the condition of the result of the execution to determine whether a cache hit occurred. Other condition may also generate replays. For example, two instructions needing the same hardware resource at the same time may cause one of the instructions to replay, as two instructions may not access the resource simultaneously.
  • Retire unit 70 retires the instruction.
  • Retire unit 70 may be coupled to and communicates with allocator/renamer 10 to de-allocate the resources used by the instruction.
  • Retire unit 70 may also be coupled to and communicates with replay queue 20 to signal that the instruction has been successfully executed and should be removed from the replay queue. If the instruction did not execute successfully, checker 60 sends the instruction back to replay queue 20 , and the execution cycle, also referred to herein as the replay loop, begins again for the instruction.
  • replay queue 20 returns instructions to scheduler 30 in program order.
  • replay queue 20 may be thought of as a rescheduled replay queue, as the instructions are passed back to scheduler 30 for rescheduling and re-execution.
  • replay queue 20 maintains the instructions in program order.
  • replay queue 20 may maintain a set of bits for each instruction that is not retired by checker 60 .
  • replay queue 20 may maintain a replay safe bit, a valid bit, an in-flight bit and a ready bit.
  • the replay queue sets the replay safe bit to one (1) or true when the instruction passes the checker, such that the instruction has completed execution successfully.
  • the replay queue sets the valid bit to one (1) or true when the instruction has been loaded and is in the replay queue.
  • the replay queue sets the in-flight bit to one (1) or true when the instruction is in an execution unit, that is, when the instruction is being executed.
  • the replay queue sets the ready bit to one (1) or true when the inputs or sources needed for the instruction to execute are known to be ready.
  • the ready bit may also be referred to as a source valid bit because it is set to one (1) or true when the sources for the instruction are valid.
  • FIG. 2 is a block diagram illustrating a processor including an embodiment of the present invention.
  • Processor 2 includes front end 4 which may include several units, such as an instruction fetch unit, an instruction decoder for decoding instructions, that is, for decoding complex instructions into one or more micro-operations or ⁇ ops, and an instruction queue (IQ) for temporarily storing instructions.
  • the instructions stored in the instruction queue may be sops.
  • other types of instructions may be used.
  • the instructions provided by the front end may originate as assembly language or machine language instructions which may, in some embodiments, be decoded from macro-operations into ⁇ ops. These ⁇ ops may be thought of as a machine language of the micro-architecture of a processor.
  • sops that, in one embodiment, are passed by the front end.
  • RISC reduced instruction set computer
  • no decoding will be required as there is a one-to-one correspondence between the RISC assembly language and the sops of the processor.
  • instructions refers to any macro or micro operations, ⁇ ops, assembly language instructions, machine language instructions, and the like.
  • allocator/renamer 10 may receive instructions in the form of ⁇ ops from front end 4 .
  • each instruction may include the instruction and up to two logical sources and one logical destination.
  • the sources and destination are logical registers (not shown) within processor 2 .
  • a register alias table (RAT) may be used to map logical registers to physical registers for the sources and the destination.
  • Physical sources are the actual internal memory addresses of memory on the chip dedicated to serve as registers.
  • Logical registers are the registers defined by the processor architecture that may be recognized by persons writing assembly language code. For example, according to the Intel Architecture known as IA-32, logical registers include EAX and EBX. Such logical registers may be mapped to physical registers.
  • Allocator/renamer 10 is coupled to replay queue 20 .
  • Replay queue 20 is coupled to scheduler 30 .
  • Scheduler 30 dispatches instructions received from the replay queue 20 to be executed. Instructions may be dispatched when the resources, namely physical registers, are marked valid to execute the instructions, and when instructions are determined to be good candidates to execute speculatively. That is, scheduler 30 may dispatch an instruction without first determining whether data needed by the instruction is valid or available. More specifically, the scheduler dispatches speculatively based on the assumption that needed data is available in the cache memory.
  • Scheduler 30 outputs instructions to execution unit 40 .
  • execution unit 40 executes received instructions.
  • Execution unit 40 may be comprised of an arithmetic logic unit (ALU), a floating point unit (FPU), a memory unit for performing memory loads (memory data reads) and stores (memory data writes), etc.
  • ALU arithmetic logic unit
  • FPU floating point unit
  • memory unit for performing memory loads (memory data reads) and stores (memory data writes), etc.
  • Execution unit 40 may be coupled to multiple levels of memory devices from which data may be retrieved and to which data may be stored.
  • execution unit 40 is coupled to L0 cache system 44 , and L1 cache system 46 , and external memory devices via memory request controller 42 .
  • the term cache system includes all cache related components, including cache memory and hit/miss logic that determines whether requested data is found in the cache memory.
  • L0 cache system 44 is the fastest memory device and may be located on the same semiconductor die as execution unit 40 . As such, data can be retrieved from and written to L0 cache very quickly.
  • L0 cache system 44 and L1 cache system 46 are located on the die of processor 2
  • L2 cache system 84 is located off the die of processor 2 .
  • an L2 cache system may be included on the die adjacent to the L1 cache system and coupled to the execution unit via a memory request controller. In such an embodiment, an L3 cache system may be located off the die of the processor.
  • execution unit 40 may attempt to retrieve needed data from additional levels of memory devices. Such requests may be made through memory request controller 42 . After L0 cache system 44 is checked, the next level of memory devices is L1 cache system 46 . If the data needed is not found in L1 cache system 46 , execution unit 40 may be forced to retrieve the needed data from the next level of memory devices, which, in one embodiment, may be external memory devices coupled to processor 2 via external bus 82 . An external bus interface 48 may be coupled to memory request controller 42 and external bus 82 .
  • external memory devices may include some, all of, and/or multiple instances of L2 cache system 84 , main memory 86 , disk memory 88 , and other storage devices, all of which may be coupled to external bus 82 .
  • main memory 86 comprises dynamic random access memory (DRAM).
  • Disk memory 88 , main memory 86 and L2 cache system 84 are considered external memory devices because they are external to the processor and are coupled to the processor via an external bus. Access to main memory 86 and disk memory 88 are substantially slower than access to L2 cache system 84 . Access to all external memory devices is much slower than access to the on-die cache memory systems.
  • a computer system may include a first external bus dedicated to an L2 cache system and a second external bus used by all other external memory devices.
  • processor 2 may include one, two, three or more levels of on-die cache memory systems.
  • execution unit 40 may attempt to load the data from each of the memory devices from fastest to slowest.
  • the fastest level of memory devices L0 cache system 44
  • L1 cache system 46 L2 cache system 84
  • main memory 86 main memory 86
  • disk memory 88 disk memory 88
  • the time to load memory increases as each additional memory level is accessed.
  • the data retrieved by execution unit 40 is stored in the fasted available memory device to allow for future access. In one embodiment, this may be L0 cache system 44 .
  • Processor 2 further includes a replay mechanism implemented via checker 60 and replay queue 20 .
  • Checker 60 is coupled to receive input from execution unit 40 and is coupled to provide output to replay queue 20 .
  • This replay mechanism provides that instructions that were not executed successfully may be re-executed or replayed.
  • staging queue 50 may be coupled between scheduler 30 and checker 60 , in parallel with execution unit 40 .
  • staging queue 50 may delay instructions for a fixed number of clock cycles so that the instruction in the execution unit and its corresponding result in the staging queue may enter the checker at the same moment in time.
  • the number of stages in staging queue 50 may vary based on the amount of staging or delay desired in each execution channel.
  • a copy of each dispatched instruction may be staged through staging queue 50 in parallel to being executed through execution unit 40 . In this manner, a copy of the instruction maintained in staging queues 50 is provided to checker 60 . This copy of the instruction may then be routed back to replay queue 20 by checker 60 for re-execution if the instruction did not execute successfully.
  • Checker 60 receives instructions output from staging queue 50 and execution unit 40 , and determines which instructions have executed successfully and which have not. If an instruction has executed successfully, checker 60 marks the instruction as completed. Completed instructions are forwarded to retire unit 70 which is coupled to checker 60 . Retire unit 70 un-re-orders instructions, placing the instructions in original, program order and retires the instruction. In addition, retire unit 70 is coupled to allocator/renamer 10 . When an instruction is retired, retire unit 70 instructs allocator/renamer 10 to de-allocate the resources that were used by the retired instruction. In addition, retire unit 70 may be coupled to and communicate with replay queue 20 so that upon retirement, a signal is sent from retire unit 70 to replay queue 20 such that all data maintained by replay queue 20 for the instruction are de-allocated.
  • Execution of an instruction may be considered unsuccessful for multiple reasons. The most common reasons are an unfulfilled source dependency and an external replay condition.
  • An unfulfilled source dependency can occur when a source of a current instruction is dependent on the result of another instruction which has not yet completed successfully. This data dependency may cause the current instruction to execute unsuccessfully if the correct data for the source is not available at execution time, that is, the result of a predecessor instruction is not available as source data at execution time resulting in a cache miss.
  • checker 60 may maintain a table known as scoreboard 62 to track the readiness of sources such as registers. Scoreboard 62 may be used by checker 60 to keep track of whether the source data was valid or correct prior to execution of an instruction. After an instruction has been executed, checker 60 may use scoreboard 62 to determine whether data sources for the instruction were valid. If the sources were not valid at execution time, this may indicate to checker 60 that the instruction did not execute successfully due to an unfulfilled data dependency, and the instruction should therefore be replayed.
  • scoreboard 62 may be used by checker 60 to keep track of whether the source data was valid or correct prior to execution of an instruction. After an instruction has been executed, checker 60 may use scoreboard 62 to determine whether data sources for the instruction were valid. If the sources were not valid at execution time, this may indicate to checker 60 that the instruction did not execute successfully due to an unfulfilled data dependency, and the instruction should therefore be replayed.
  • External replay conditions may include a cache miss (e.g., source data was not found in the L0 cache system at time of execution), incorrect forwarding of data (e.g., from a store buffer to a load), hidden memory dependencies, a write back conflict, an unknown data address, serializing instructions, etc.
  • L0 cache system 44 and L1 cache system 46 may be coupled to checker 60 .
  • L0 cache system 44 may generate an L0 cache miss signal to checker 60 when there is a cache miss at the L0 cache system. Such a cache miss indicates that the source data for the instruction was not found in L0 cache system 44 .
  • similar information and/or signals may be similarly generated to checker 60 to indicate the occurrence of a cache miss at L1 cache system 46 and external replay conditions, such as from any external memory devices, including L2 cache, main memory, disk memory, etc. In this way, checker 60 may determine whether each instruction has executed successfully.
  • checker 60 determines that the instruction has not executed successfully, checker 60 signals replay queue 20 that the instruction must be replayed, re-executed. More specifically, checker 60 sends a replay needed signal to replay queue 20 that also identifies the instruction by an instruction sequence number, Instructions that do not execute successfully may be signaled to replay queue 20 such that all instructions, regardless of the type of instruction or the specific circumstances under which the instruction failed to execute successfully will be signaled to replay queue 20 and replayed. Such unconditional replaying works well for instructions with short latencies which require only one or a small number of passes or replay iterations between checker 60 and replay queue 20 .
  • checker 60 determines that the instruction is safe for replay and signals replay queue 20 that the instruction needs to be replayed. That is, because the source data was not available or correct at the time of execution, but that the data sources are now available and ready, checker 60 determines that the instruction is safe for replay and signals replay queue 20 that the instruction is replay safe.
  • the replay queue may mark a replay safe bit as true and clears an in-flight bit for the instruction.
  • checker 60 may signal replay queue 20 with a replay safe bit paired with an instruction identifier.
  • the signals may be replaced with the checker sending actual instructions and accompanying information to the replay queue. If at the time execution of the instruction is completed the sources were not and are still not available, the replay safe bit and the in-flight bit may be cleared, that is may be set to false.
  • checker 60 may determine whether the instruction requires a relatively long period of time to execute (i.e., a long latency instruction), requiring several replays before executing properly.
  • long latency instructions There are many examples of long latency instructions.
  • One example is a divide instruction which may require many clock cycles to execute.
  • a long latency instruction is a memory load or store instruction involving multiple levels of cache system misses, such as an L0 cache system miss and an L1 cache system miss.
  • an external bus request may be required to retrieve the data for the instruction. If access across an external bus is required to retrieve the desired data, the access delay is substantially increased.
  • a memory request controller may be required to arbitrate for ownership of an external bus, issue a bus transaction (memory read) to the external bus, and then await return of the data from one of the external memory devices.
  • an instruction may circulate an inordinate number of times, anywhere from tens to hundreds of iterations, from replay queue 20 to scheduler 30 to execution unit 40 to checker 60 and back again.
  • a long latency instruction may be replayed before the source data has returned, this instruction unnecessarily occupies a slot in the execution pipeline and uses execution resources which could have been allocated to other instructions which are ready to execute and may execute successfully.
  • an instruction for one of the pixels may be a long latency instruction, e.g., requiring a memory access to an external memory device.
  • non-dependent instructions for other pixels may be precluded from execution.
  • execution slots and resources become available, and the instructions for the other pixels may then be executed.
  • long latency instructions and instructions dependent thereon are kept in the replay queue until the sources for the long latency instruction are available. The instructions are then released.
  • the long latency instruction and its dependent instructions may then be sent by replay queue 20 to scheduler 30 for replay.
  • this may be accomplished by setting the valid bit to true and clearing, or setting to false or zero, the other bits, such as the in-flight bit, the replay safe bit and the ready bit, for each of the long latency instruction and instructions dependent thereon.
  • the ready bit for the long latency instruction may be set to true, thus causing the replay queue to release the long-latency instruction and instructions dependent thereon to the scheduler. In this manner, long latency instructions will not unnecessarily delay execution of other non-dependent instructions. Performance is improved, throughput is increased, and power consumption is decreased when the non-dependent instructions execute in parallel while the long latency instruction awaits return of its data.
  • a long latency instruction may be identified and loaded into replay queue 20 , and one or more additional instructions, that is, instructions which may be dependent upon the long latency instruction, may also be loaded into replay queue 20 .
  • additional instructions that is, instructions which may be dependent upon the long latency instruction
  • replay queue 20 When the condition causing the long latency instruction to not complete successfully is cleared, such as, for example, when data returns from an external bus after a cache miss or after completion of a division or multiplication operation or completion of another long latency instruction, replay queue 20 then transfers the instructions to scheduler 30 so that the long latency instruction and the additional instructions may then be replayed, re-executed.
  • a counter may also be used.
  • a counter may be combined with the instruction and related information as it is passed from replay queue to scheduler to execution unit to checker within the processor.
  • a counter may be included with the replay queue.
  • an on-chip memory device may be dedicated to keep track of the replay status of all pending instructions and may be accessible by the scheduler, the replay queue, and/or the checker.
  • the counter may be used to maintain a the number of times an instruction has been executed or replayed.
  • the counter for the instruction may automatically be incremented. The counter is used to break replay dependency loops and to alleviate unnecessary execution of instructions that cannot yet be successfully executed.
  • the scheduler checks whether the counter for the instruction has exceeded a machine specified maximum number of replays. If the counter exceeds the maximum, the instruction is not scheduled to be executed until the data required by the instruction is available. When the data is available, the instruction is deemed safe for execution. According to this method, any instruction cannot loop through the processor more than the machine specified maximum number of iterations. In this way the replay loop is broken when the machine specified maximum number of iterations has been-exceeded.
  • FIG. 3 is a flow chart illustrating a method of instruction processing according to an embodiment of the present invention.
  • a plurality of instructions are received, as shown in block 110 .
  • System resources are then allocated for use with the execution of the instructions, including renaming of resources, as shown in block 112 .
  • the instructions are then placed in a queue, as shown in block 114 .
  • Execution of the instructions is then scheduled based on data dependencies of the instructions and expected latencies of the instructions, as shown in block 116 .
  • a check is then made to determine whether a counter for the instruction is set to zero, signifying that it will be the first time that the instruction will be executed, as shown in block 118 . If the counter is set to zero, the instruction is then executed as shown in block 124 .
  • the instruction is being executed or the first time, if the instruction is safe to be executed, or if the maximum number of replays of the instruction has not been exceeded, an attempt is made at executing the instruction, as shown in block 124 . In the situations where the instruction is being executed for the first time and when the counter has not been exceeded, the instruction is executed whether or not the data for it is available. In this way, the execution may be made speculatively. A check is then made to determine if the execution of the instruction was successful, as shown in block 126 . If execution of the instruction was not successful, the method continues at block 128 where the counter for the instruction is incremented.
  • the a signal is then sent to the queue signifying that that execution was unsuccessful and the instruction should be rescheduled for re-execution. If execution of the instruction is successful, as shown in block 126 , the instruction is retired, including placing the instructions in program order, de-allocating system resources used by the instruction, and removing the instruction and related information from the queue, as shown in block 130 .
  • the checks for data availability and exceeding the replay counter may be reversed.
  • blocks 120 and 122 may be executed in reverse order.
  • a check is made to learn whether the instruction is safe, and, if it is, the instruction is executed. If the instruction is not safe, a check is made to determine whether the maximum number of replays has been exceeded; if it has, the replay loop is broken, and flow proceeds to block 116 . If the maximum number of replays has not been exceeded, a speculative execution proceeds.

Abstract

Breaking replay dependency loops in a processor using a rescheduled replay queue. The processor comprises a replay queue to receive a plurality of instructions, and an execution unit to execute the plurality of instructions. A scheduler is coupled between the replay queue and the execution unit. The scheduler speculatively schedules instructions for execution and increments a counter for each of the plurality of instructions to reflect the number of times each of the plurality of instructions has been executed. The scheduler also dispatches each instruction to the execution unit either when the counter does not exceed a maximum number of replays or, if the counter exceeds the maximum number of replays, when the instruction is safe to execute. A checker is coupled to the execution unit to determine whether each instruction has executed successfully. The checker is also coupled to the replay queue to communicate to the replay queue each instruction that has not executed successfully.

Description

FIELD OF THE INVENTION
The invention generally relates to processors, and in particular to processors having a replay system for data speculation.
BACKGROUND
The primary function of most computer processors is to execute a stream of computer instructions that are retrieved from a storage device. Many processors are designed to fetch an instruction and execute that instruction before fetching the next instruction. With these processors, there is an assurance that any register or memory value that is modified or retrieved by a given instruction will be available to instructions following it. For example, consider the following set of instructions:
    • 1: Load memory-1 into register-X;
    • 2: Add register-X register-Y into register-Z;
    • 3: Add register-Y register-Z into register-W.
      The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The third instruction adds the content of register-Y to the content of register-Z and stores the result in register-W. In this set of instructions, instructions 2 and 3 are considered dependent instructions because their execution depends on the result of and prior execution of instruction 1. In other words, if register-X is not loaded with valid data in instruction 1 before instructions 2 and 3 are executed, instructions 2 and 3 will generate improper results.
With traditional fetch and execute processors, the second instruction will not be executed until the first instruction has properly executed. For example, the second instruction may not be dispatched to the processor until a signal is received stating that the first instruction has completed execution. Similarly, because the third instruction is dependent on the second instruction, the third instruction will not be dispatched until an indication that the second instruction has properly executed has been received. Therefore, according to the traditional fetch and execute method, this short sequence of instructions cannot be executed in less time than T=L1+L2+L3, where T represents time and L1, L2 and L3 represent the latency of each of the three instructions.
To increase the speed of processing, later, dependent instructions may be selectively chosen and speculatively executed in an effort to anticipate the results needed to increase the throughput of the processor and decrease overall execution time. Data speculation may involve speculating that data retrieved from a cache memory is valid. Processing proceeds on the assumption that data retrieved from the cache is good. However, when the data in the cache is invalid, the results of the execution of the speculatively executed instructions are disregarded, and the processor backs up to re-execute the instruction that was executed. Stated another way, data speculation assumes that data in a cache memory are correct, that is, that the cache memory contains the result from those instructions on which the present instruction is dependent. Data speculation may involve speculating that data from the execution of an instruction on which the present instruction is dependent will be stored in a location in cache memory such that the data in the cache memory will be valid by the time the instruction attempts to access the location in cache memory. The dependent instruction is dependent on the result of a load of the result of the instruction on which it is dependent. When the load misses the cache, the dependent instruction must be re-executed.
For example, while instruction 1 is executing, instruction 2 may be executed in parallel such that instruction 2 speculatively accesses the value stored in a location in cache memory where the result of instruction 1 will be stored. In this way, instruction 2 executes assuming a cache hit. If the value of the contents of the cache memory is valid, then the execution of instruction 1 has completed. If instruction 2 is successfully speculatively executed in advance of the completion of instruction 1, then, rather than execute instruction 2 at the completion of instruction 1, a simple and quick check may be made to confirm that the speculative execution was successful. In this way, processors increase their execution speed and throughput by executing instructions in advance by speculatively executing instruction based on the assumption that needed data will be available in cache memory.
However, in some circumstances instructions are not successfully speculatively executed. In these situations, the speculatively executed instructions must be re-executed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram illustrating a portion of a processor according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a processor including an embodiment of the present invention.
FIG. 3 is a flow chart illustrating a method of instruction processing according to an embodiment of the present invention.
DETAILED DESCRIPTION
A. Introduction
According to an embodiment of the present invention, a processor is provided that speculatively schedules instructions for execution and allows for replay of unsuccessfully executed instructions. Speculative scheduling allows the scheduling latency for instructions to be reduced. The replay system re-executes instructions that were not successfully executed when they were originally dispatched to an execution unit. An instruction is considered not successfully executed when the instruction is executed with bad input data, or an instruction whose output are bad due to a cache miss, etc. For example, a memory load instruction may not execute properly if there is a cache miss during execution, thereby requiring the instruction to be re-executed. In addition, all instructions dependent thereon must also be replayed.
A challenging aspect of such a replay system is the possibility for long latency instructions to circulate through the replay system and re-execute many times before executing successfully. With long latency instructions, in certain circumstances, several conditions occur which must be resolved serially. When this occurs, each condition results in extra replays for all dependent instructions. This results in several instructions incurring several sequential cache misses. This condition occurs when a chain of dependent instructions are replaying. In the chain, several instructions may each take a cache miss. This results in a cascading set of replays. Each instruction must replay an extra time for each miss in the dependency path. For example, a long latency instruction and instructions dependent on the long latency instruction may be re-executed multiple times until a cache hit occurs.
An example of a long latency instruction is a memory load instruction in which there is an L0 cache miss and an L1 cache miss (i.e., on-chip cache misses) on a first attempt at executing an instruction. As a result, the execution unit may then retrieve the data from an external memory device across an external bus. This retrieval may be very time consuming, requiring several hundred clock cycles. Any unnecessary and repeated re-execution of this long latency load instruction before its source data has become available wastes valuable execution resources, prevents other instructions from executing, and increases overall processor latency.
Therefore, according to an embodiment, a replay queue is provided for storing the instructions. After unsuccessful execution of an instruction, the instruction, and instructions dependent thereon that have also unsuccessfully executed, are stored in the replay queue until the data the instruction requires returns, that is, the cache memory location for the data is valid. At this time, the instruction is considered ready for execution. For example, when the data for a memory load instruction returns from external memory, the memory load instruction may then be scheduled for execution, and any dependent instructions may then be scheduled after execution of the memory load instruction has completed.
B. System Architecture
FIG. 1 is a block diagram illustrating a portion of a processor according to an embodiment of the present invention. Allocator/renamer 10 receives instructions from a front end (not shown). After allocating system resources for the instructions, the instructions are passed to replay queue 20. Replay queue 20 stores instructions in program order. Replay queue 20 may mark instructions as safe or unsafe based on whether the data required by the instruction is available. Instructions are then sent to scheduler 30. Although only one scheduler 30 is depicted so as to simplify the description of the invention, multiple schedulers may be coupled to replay queue 20. Scheduler 30 may re-order the instructions to execute the instructions out of program order to achieve efficiencies inherent with data speculation. To avoid excessive replay looping which decreases processor throughput, increases energy consumption and increases heat emitted, scheduler 30 may include counter 32. Counter 32 may be used to maintain a counter for each instruction representing the number of times the instruction has been executed or replayed. In one embodiment, a counter may be included with replay queue 20. In another embodiment, a counter may be paired with and travel with the instruction. In yet another embodiment, an on-chip memory device may be dedicated to keep track of the replay status of all pending instructions. In this embodiment, the on-chip counter may be accessed by replay queue 20 and/or scheduler 30 and/or checker 60.
In determining the order in which instructions will be sent to be executed, scheduler 30 uses the data dependencies of instructions and the latencies of instructions in determining the order of execution of instructions. That is, based on a combination of the data dependencies of various instructions and the anticipated execution time of instructions, that is, the latency of the instructions, the scheduler determines the order of execution of instructions. Scheduler 30 may refer to counter 32 to determine whether a maximum number of executions or replays has already occurred. In this way, excessive replays of instructions that are unsafe for replaying can be avoided, thus freeing up system resources for the execution of other instructions. Scheduler 30 passes instructions to execution unit 40. Although only one execution unit 40 is depicted so as to simplify the description of the invention, multiple execution units may be coupled to multiple schedulers. In one embodiment, the scheduler may stagger instructions among a plurality of execution units such that execution of the instructions is performed in parallel, and out of order, in an attempt to match the data dependencies of instructions with the expected completion of other parallelly executing instructions on which the instruction is dependent. In one embodiment, a staging queue 50 may also be used to send instruction information in parallel with execution unit 40 in an unexecuted, delayed form. Although a staging queue is depicted, some form of staging is required and any other staging method may be used.
After execution of the instruction, checker 60 determines whether execution was successful. In one embodiment, this may be achieved by analyzing the data dependency of the instruction, and whether a cache hit or cache miss occurred at the cache location of the needed data. More specifically, the checker checks whether replay is necessary by checking the input registers of the instruction to determine whether the data contained therein was valid. To determine whether replay is necessary, the checker may check the condition of the result of the execution to determine whether a cache hit occurred. Other condition may also generate replays. For example, two instructions needing the same hardware resource at the same time may cause one of the instructions to replay, as two instructions may not access the resource simultaneously.
If the instruction executed successfully, the instruction and its result are sent to retire unit 70 to be retired. Retire unit 70 retires the instruction. Retire unit 70 may be coupled to and communicates with allocator/renamer 10 to de-allocate the resources used by the instruction. Retire unit 70 may also be coupled to and communicates with replay queue 20 to signal that the instruction has been successfully executed and should be removed from the replay queue. If the instruction did not execute successfully, checker 60 sends the instruction back to replay queue 20, and the execution cycle, also referred to herein as the replay loop, begins again for the instruction. In one embodiment, replay queue 20 returns instructions to scheduler 30 in program order. In one embodiment, replay queue 20 may be thought of as a rescheduled replay queue, as the instructions are passed back to scheduler 30 for rescheduling and re-execution. In one embodiment, replay queue 20 maintains the instructions in program order.
To determine when instructions should be passed to the scheduler, in one embodiment, replay queue 20 may maintain a set of bits for each instruction that is not retired by checker 60. In this embodiment, replay queue 20 may maintain a replay safe bit, a valid bit, an in-flight bit and a ready bit. The replay queue sets the replay safe bit to one (1) or true when the instruction passes the checker, such that the instruction has completed execution successfully. The replay queue sets the valid bit to one (1) or true when the instruction has been loaded and is in the replay queue. The replay queue sets the in-flight bit to one (1) or true when the instruction is in an execution unit, that is, when the instruction is being executed. The replay queue sets the ready bit to one (1) or true when the inputs or sources needed for the instruction to execute are known to be ready. The ready bit may also be referred to as a source valid bit because it is set to one (1) or true when the sources for the instruction are valid.
FIG. 2 is a block diagram illustrating a processor including an embodiment of the present invention. Processor 2 includes front end 4 which may include several units, such as an instruction fetch unit, an instruction decoder for decoding instructions, that is, for decoding complex instructions into one or more micro-operations or μops, and an instruction queue (IQ) for temporarily storing instructions. In one embodiment, the instructions stored in the instruction queue may be sops. In other embodiments, other types of instructions may be used. The instructions provided by the front end may originate as assembly language or machine language instructions which may, in some embodiments, be decoded from macro-operations into μops. These μops may be thought of as a machine language of the micro-architecture of a processor. It is these sops that, in one embodiment, are passed by the front end. In another embodiment, particularly in a reduced instruction set computer (RISC) processor chip, no decoding will be required as there is a one-to-one correspondence between the RISC assembly language and the sops of the processor. As set forth herein, the term instructions refers to any macro or micro operations, μops, assembly language instructions, machine language instructions, and the like.
In one embodiment, allocator/renamer 10 may receive instructions in the form of μops from front end 4. In one embodiment, each instruction may include the instruction and up to two logical sources and one logical destination. The sources and destination are logical registers (not shown) within processor 2. In one embodiment, a register alias table (RAT) may be used to map logical registers to physical registers for the sources and the destination. Physical sources are the actual internal memory addresses of memory on the chip dedicated to serve as registers. Logical registers are the registers defined by the processor architecture that may be recognized by persons writing assembly language code. For example, according to the Intel Architecture known as IA-32, logical registers include EAX and EBX. Such logical registers may be mapped to physical registers.
Allocator/renamer 10 is coupled to replay queue 20. Replay queue 20 is coupled to scheduler 30. Although only one scheduler is depicted so as to simplify the description of the invention, multiple schedulers may be coupled to the replay queue. Scheduler 30 dispatches instructions received from the replay queue 20 to be executed. Instructions may be dispatched when the resources, namely physical registers, are marked valid to execute the instructions, and when instructions are determined to be good candidates to execute speculatively. That is, scheduler 30 may dispatch an instruction without first determining whether data needed by the instruction is valid or available. More specifically, the scheduler dispatches speculatively based on the assumption that needed data is available in the cache memory. That is, the scheduler dispatches instructions based on latencies, assuming that the cache location holding needed input to an instruction will result in a cache hit when the instruction requests needed data from the cache memory during execution. Scheduler 30 outputs instructions to execution unit 40. Although only one execution unit is depicted so as to simplify the description of the invention, multiple execution units may be coupled to multiple schedulers. Execution unit 40 executes received instructions. Execution unit 40 may be comprised of an arithmetic logic unit (ALU), a floating point unit (FPU), a memory unit for performing memory loads (memory data reads) and stores (memory data writes), etc.
Execution unit 40 may be coupled to multiple levels of memory devices from which data may be retrieved and to which data may be stored. In one embodiment, execution unit 40 is coupled to L0 cache system 44, and L1 cache system 46, and external memory devices via memory request controller 42. As described herein, the term cache system includes all cache related components, including cache memory and hit/miss logic that determines whether requested data is found in the cache memory. L0 cache system 44 is the fastest memory device and may be located on the same semiconductor die as execution unit 40. As such, data can be retrieved from and written to L0 cache very quickly. In one embodiment, L0 cache system 44 and L1 cache system 46 are located on the die of processor 2, while L2 cache system 84 is located off the die of processor 2. In another embodiment, an L2 cache system may be included on the die adjacent to the L1 cache system and coupled to the execution unit via a memory request controller. In such an embodiment, an L3 cache system may be located off the die of the processor.
If data requested by execution unit 40 is not found in L0 cache system 44, execution unit 40 may attempt to retrieve needed data from additional levels of memory devices. Such requests may be made through memory request controller 42. After L0 cache system 44 is checked, the next level of memory devices is L1 cache system 46. If the data needed is not found in L1 cache system 46, execution unit 40 may be forced to retrieve the needed data from the next level of memory devices, which, in one embodiment, may be external memory devices coupled to processor 2 via external bus 82. An external bus interface 48 may be coupled to memory request controller 42 and external bus 82. In one embodiment, external memory devices may include some, all of, and/or multiple instances of L2 cache system 84, main memory 86, disk memory 88, and other storage devices, all of which may be coupled to external bus 82. In one embodiment, main memory 86 comprises dynamic random access memory (DRAM). Disk memory 88, main memory 86 and L2 cache system 84 are considered external memory devices because they are external to the processor and are coupled to the processor via an external bus. Access to main memory 86 and disk memory 88 are substantially slower than access to L2 cache system 84. Access to all external memory devices is much slower than access to the on-die cache memory systems.
In one embodiment, a computer system may include a first external bus dedicated to an L2 cache system and a second external bus used by all other external memory devices. In various embodiments, processor 2 may include one, two, three or more levels of on-die cache memory systems.
When attempting to load data to a register from memory, execution unit 40 may attempt to load the data from each of the memory devices from fastest to slowest. In one embodiment, the fastest level of memory devices, L0 cache system 44, is checked first, followed by L1 cache system 46, L2 cache system 84, main memory 86, and disk memory 88. The time to load memory increases as each additional memory level is accessed. When the needed data is eventually found, the data retrieved by execution unit 40 is stored in the fasted available memory device to allow for future access. In one embodiment, this may be L0 cache system 44.
Processor 2 further includes a replay mechanism implemented via checker 60 and replay queue 20. Checker 60 is coupled to receive input from execution unit 40 and is coupled to provide output to replay queue 20. This replay mechanism provides that instructions that were not executed successfully may be re-executed or replayed. In one embodiment, staging queue 50 may be coupled between scheduler 30 and checker 60, in parallel with execution unit 40. In this embodiment, staging queue 50 may delay instructions for a fixed number of clock cycles so that the instruction in the execution unit and its corresponding result in the staging queue may enter the checker at the same moment in time. In various embodiments, the number of stages in staging queue 50 may vary based on the amount of staging or delay desired in each execution channel. A copy of each dispatched instruction may be staged through staging queue 50 in parallel to being executed through execution unit 40. In this manner, a copy of the instruction maintained in staging queues 50 is provided to checker 60. This copy of the instruction may then be routed back to replay queue 20 by checker 60 for re-execution if the instruction did not execute successfully.
Checker 60 receives instructions output from staging queue 50 and execution unit 40, and determines which instructions have executed successfully and which have not. If an instruction has executed successfully, checker 60 marks the instruction as completed. Completed instructions are forwarded to retire unit 70 which is coupled to checker 60. Retire unit 70 un-re-orders instructions, placing the instructions in original, program order and retires the instruction. In addition, retire unit 70 is coupled to allocator/renamer 10. When an instruction is retired, retire unit 70 instructs allocator/renamer 10 to de-allocate the resources that were used by the retired instruction. In addition, retire unit 70 may be coupled to and communicate with replay queue 20 so that upon retirement, a signal is sent from retire unit 70 to replay queue 20 such that all data maintained by replay queue 20 for the instruction are de-allocated.
Execution of an instruction may be considered unsuccessful for multiple reasons. The most common reasons are an unfulfilled source dependency and an external replay condition. An unfulfilled source dependency can occur when a source of a current instruction is dependent on the result of another instruction which has not yet completed successfully. This data dependency may cause the current instruction to execute unsuccessfully if the correct data for the source is not available at execution time, that is, the result of a predecessor instruction is not available as source data at execution time resulting in a cache miss.
In one embodiment, checker 60 may maintain a table known as scoreboard 62 to track the readiness of sources such as registers. Scoreboard 62 may be used by checker 60 to keep track of whether the source data was valid or correct prior to execution of an instruction. After an instruction has been executed, checker 60 may use scoreboard 62 to determine whether data sources for the instruction were valid. If the sources were not valid at execution time, this may indicate to checker 60 that the instruction did not execute successfully due to an unfulfilled data dependency, and the instruction should therefore be replayed.
External replay conditions may include a cache miss (e.g., source data was not found in the L0 cache system at time of execution), incorrect forwarding of data (e.g., from a store buffer to a load), hidden memory dependencies, a write back conflict, an unknown data address, serializing instructions, etc. In one embodiment, L0 cache system 44 and L1 cache system 46 may be coupled to checker 60. In this embodiment L0 cache system 44 may generate an L0 cache miss signal to checker 60 when there is a cache miss at the L0 cache system. Such a cache miss indicates that the source data for the instruction was not found in L0 cache system 44. In another embodiment, similar information and/or signals may be similarly generated to checker 60 to indicate the occurrence of a cache miss at L1 cache system 46 and external replay conditions, such as from any external memory devices, including L2 cache, main memory, disk memory, etc. In this way, checker 60 may determine whether each instruction has executed successfully.
If checker 60 determines that the instruction has not executed successfully, checker 60 signals replay queue 20 that the instruction must be replayed, re-executed. More specifically, checker 60 sends a replay needed signal to replay queue 20 that also identifies the instruction by an instruction sequence number, Instructions that do not execute successfully may be signaled to replay queue 20 such that all instructions, regardless of the type of instruction or the specific circumstances under which the instruction failed to execute successfully will be signaled to replay queue 20 and replayed. Such unconditional replaying works well for instructions with short latencies which require only one or a small number of passes or replay iterations between checker 60 and replay queue 20.
As instructions may be speculatively scheduled for execution (i.e., before actually waiting for the correct source data to be available) on the expectation that the source data will be available, if it turns out that the source data was not available at the time of execution but that the sources are now valid, checker 60 determines that the instruction is safe for replay and signals replay queue 20 that the instruction needs to be replayed. That is, because the source data was not available or correct at the time of execution, but that the data sources are now available and ready, checker 60 determines that the instruction is safe for replay and signals replay queue 20 that the instruction is replay safe. In one embodiment, the replay queue may mark a replay safe bit as true and clears an in-flight bit for the instruction. In this situation, in one embodiment, checker 60 may signal replay queue 20 with a replay safe bit paired with an instruction identifier. In other embodiments, the signals may be replaced with the checker sending actual instructions and accompanying information to the replay queue. If at the time execution of the instruction is completed the sources were not and are still not available, the replay safe bit and the in-flight bit may be cleared, that is may be set to false.
Some long latency instructions may require many iterations through the replay loop before finally executing successfully. If the instruction did not execute successfully on the first attempt, checker 60 may determine whether the instruction requires a relatively long period of time to execute (i.e., a long latency instruction), requiring several replays before executing properly. There are many examples of long latency instructions. One example is a divide instruction which may require many clock cycles to execute.
Another example of a long latency instruction is a memory load or store instruction involving multiple levels of cache system misses, such as an L0 cache system miss and an L1 cache system miss. In such cases, an external bus request may be required to retrieve the data for the instruction. If access across an external bus is required to retrieve the desired data, the access delay is substantially increased. To retrieve data from an external memory, a memory request controller may be required to arbitrate for ownership of an external bus, issue a bus transaction (memory read) to the external bus, and then await return of the data from one of the external memory devices. Many more clock cycles may be required to retrieve data from a memory device on an external bus versus when compared to the time needed to retrieve data from on-chip cache systems such as, for example, L0 cache system or L1 cache system. Thus, due to the need to retrieve data from an external memory device across an external bus, load instructions involving cache misses of on-chip cache systems may be considered to be long latency instructions.
During this relatively long period of time while the long latency instruction is being processed, it is possible that an instruction may circulate an inordinate number of times, anywhere from tens to hundreds of iterations, from replay queue 20 to scheduler 30 to execution unit 40 to checker 60 and back again. Through each iteration, a long latency instruction may be replayed before the source data has returned, this instruction unnecessarily occupies a slot in the execution pipeline and uses execution resources which could have been allocated to other instructions which are ready to execute and may execute successfully. Moreover, there may be many additional instructions which are dependent upon the result of this long latency instruction which will similarly repeatedly circulate without properly executing. These dependent instructions will not execute properly until after the data for the long latency instruction returns from the external memory device, occupying and wasting even additional execution resources. The unnecessary and excessive iterations which may occur before the return of needed data may waste execution resources, may waste power, and may increase overall latency. In addition, such iterations may cause a backup of instructions and greatly reduce processor performance in the form of reduced throughput.
For example, where several calculations are being performed for displaying pixels on a display, an instruction for one of the pixels may be a long latency instruction, e.g., requiring a memory access to an external memory device. There may be many non-dependent instructions for other pixels behind this long latency instruction that do not require an external memory access. As a result, by continuously replaying the long latency instruction and its dependent instructions, non-dependent instructions for other pixels may be precluded from execution. Once the long latency instruction has properly executed, execution slots and resources become available, and the instructions for the other pixels may then be executed. To prevent this condition, long latency instructions and instructions dependent thereon are kept in the replay queue until the sources for the long latency instruction are available. The instructions are then released. This is achieved by storing long latency instructions and instructions dependent thereon in replay queue 20. When data for a long latency instruction becomes available, such as when returning from an external memory device, the long latency instruction and its dependent instructions may then be sent by replay queue 20 to scheduler 30 for replay. In one embodiment, this may be accomplished by setting the valid bit to true and clearing, or setting to false or zero, the other bits, such as the in-flight bit, the replay safe bit and the ready bit, for each of the long latency instruction and instructions dependent thereon. In this embodiment, when the replay queue learns that the sources for the long latency instruction are available, the ready bit for the long latency instruction may be set to true, thus causing the replay queue to release the long-latency instruction and instructions dependent thereon to the scheduler. In this manner, long latency instructions will not unnecessarily delay execution of other non-dependent instructions. Performance is improved, throughput is increased, and power consumption is decreased when the non-dependent instructions execute in parallel while the long latency instruction awaits return of its data.
In one embodiment, a long latency instruction may be identified and loaded into replay queue 20, and one or more additional instructions, that is, instructions which may be dependent upon the long latency instruction, may also be loaded into replay queue 20. When the condition causing the long latency instruction to not complete successfully is cleared, such as, for example, when data returns from an external bus after a cache miss or after completion of a division or multiplication operation or completion of another long latency instruction, replay queue 20 then transfers the instructions to scheduler 30 so that the long latency instruction and the additional instructions may then be replayed, re-executed.
However, it may be difficult to identify dependent instructions because there can be hidden memory dependencies, etc. Therefore, in one embodiment, when a long latency instruction is identified and loaded into replay queue 20, all additional instructions which do not execute properly and are programmatically younger may be loaded into the replay queue as well. That is, in one embodiment, younger instructions may have sequence numbers greater than those of older instructions that were provided by the front end earlier in time.
To avoid unnecessary replay of instructions, a counter may also be used. In one embodiment, a counter may be combined with the instruction and related information as it is passed from replay queue to scheduler to execution unit to checker within the processor. In another embodiment, a counter may be included with the replay queue. In yet another embodiment, an on-chip memory device may be dedicated to keep track of the replay status of all pending instructions and may be accessible by the scheduler, the replay queue, and/or the checker. In any of these embodiments, the counter may be used to maintain a the number of times an instruction has been executed or replayed. In one embodiment, when the scheduler receives an instruction, the counter for the instruction may automatically be incremented. The counter is used to break replay dependency loops and to alleviate unnecessary execution of instructions that cannot yet be successfully executed. In one embodiment, the scheduler checks whether the counter for the instruction has exceeded a machine specified maximum number of replays. If the counter exceeds the maximum, the instruction is not scheduled to be executed until the data required by the instruction is available. When the data is available, the instruction is deemed safe for execution. According to this method, any instruction cannot loop through the processor more than the machine specified maximum number of iterations. In this way the replay loop is broken when the machine specified maximum number of iterations has been-exceeded.
C. A Method of Instruction Processing
FIG. 3 is a flow chart illustrating a method of instruction processing according to an embodiment of the present invention. A plurality of instructions are received, as shown in block 110. System resources are then allocated for use with the execution of the instructions, including renaming of resources, as shown in block 112. The instructions are then placed in a queue, as shown in block 114. Execution of the instructions is then scheduled based on data dependencies of the instructions and expected latencies of the instructions, as shown in block 116. A check is then made to determine whether a counter for the instruction is set to zero, signifying that it will be the first time that the instruction will be executed, as shown in block 118. If the counter is set to zero, the instruction is then executed as shown in block 124. If the counter is not zero, a check is then made to determine whether the counter for the instruction exceeds a system specified maximum, as shown in block 120. That is, a check is made to determine whether the maximum number of replay iterations has been exceeded. If the counter does not exceed a system specified maximum, the instruction is executed, as shown in block 124. If the counter exceeds the system specified maximum, a check is made to determine whether the data sources for the instruction are ready and available, as shown in block 122. That is, a check is made to determine whether the instruction is safe to be executed. If the instruction is not safe to be executed and if the maximum number of replays has been exceeded, the replay loop is broken, and flow continues at block 116, as shown in blocks 120 and 122. That is, whenever the instruction is not safe for execution and the maximum number of replays or executions is exceeded, further replays are prevented until the data for the instruction is available.
If the instruction is being executed or the first time, if the instruction is safe to be executed, or if the maximum number of replays of the instruction has not been exceeded, an attempt is made at executing the instruction, as shown in block 124. In the situations where the instruction is being executed for the first time and when the counter has not been exceeded, the instruction is executed whether or not the data for it is available. In this way, the execution may be made speculatively. A check is then made to determine if the execution of the instruction was successful, as shown in block 126. If execution of the instruction was not successful, the method continues at block 128 where the counter for the instruction is incremented. The a signal is then sent to the queue signifying that that execution was unsuccessful and the instruction should be rescheduled for re-execution. If execution of the instruction is successful, as shown in block 126, the instruction is retired, including placing the instructions in program order, de-allocating system resources used by the instruction, and removing the instruction and related information from the queue, as shown in block 130.
In another embodiment, the checks for data availability and exceeding the replay counter may be reversed. In this embodiment, blocks 120 and 122 may be executed in reverse order. In this embodiment, a check is made to learn whether the instruction is safe, and, if it is, the instruction is executed. If the instruction is not safe, a check is made to determine whether the maximum number of replays has been exceeded; if it has, the replay loop is broken, and flow proceeds to block 116. If the maximum number of replays has not been exceeded, a speculative execution proceeds.
In the foregoing specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (23)

1. A processor comprising:
a replay queue to receive a plurality of instructions;
an execution unit to execute the plurality of instructions;
a scheduler coupled between the replay queue and the execution unit to speculatively schedule instructions for execution, to increment a counter for each of the plurality of instructions to reflect the number of times each of the plurality of instructions has been executed, and to dispatch each instruction of the plurality of instructions to the execution unit either when the counter does not exceed a maximum number of replays or, if the counter for the instruction exceeds the maximum number of replays, when the instruction is safe to execute; and
a checker coupled to the execution unit to determine whether each instruction has executed successfully, and coupled to the replay queue to communicate to the replay queue each instruction that has not executed successfully.
2. The processor of claim 1 further comprising:
an allocator/renamer coupled to the replay queue to allocate and rename those of a plurality of resources needed by the instruction.
3. The processor of claim 2 further comprising:
a front end coupled to the allocator/renamer to provide the plurality of instructions to the allocator/renamer.
4. The processor of claim 2 further comprising:
a retire unit to retire the plurality of instructions, coupled to the checker to receive those of the plurality of instructions that have executed successfully, and coupled to the allocator/renamer to communicate a de-allocate signal to the allocator/renamer.
5. The processor of claim 4 wherein the retire unit is further coupled to the replay queue to communicate a retire signal when one of the plurality of instructions is retired.
6. The processor of claim 1 further comprising:
at least one cache system on a die of the processor;
a plurality of external memory devices; and
a memory request controller coupled to the execution unit to obtain data from the at least one cache system and the plurality of external memory devices.
7. The processor of claim 6 wherein the at least one cache system comprises a first level cache system and a second level cache system.
8. The processor of claim 6 wherein the external memory devices comprise at least one of a third level cache system, a main memory, and a disk memory.
9. The processor of claim 1 further comprising:
a staging queue coupled between the checker and the scheduler.
10. The processor of claim 1 wherein the counter is one of a plurality of counters such that each counter of the plurality of counters is paired with one of the plurality of instructions.
11. The processor of claim 1 wherein the checker comprises a scoreboard to maintain a status of a plurality of resources.
12. A processor comprising:
a replay queue to receive a plurality of instructions;
at least two execution units to execute the plurality of instructions;
at least two schedulers coupled between the replay queue and the execution units to schedule instructions for execution, to increment a counter for each of the plurality of instructions to reflect the number of times each of the plurality of instructions has been executed, and to communicate each instruction of the plurality of instructions to the execution units when the counter does not exceed a maximum number or, if the counter for the instruction exceeds the maximum number of replays, when a data required by the instruction is available; and
a checker coupled to the execution units to determine whether each instruction has executed successfully, and coupled to the replay queue to communicate each instruction that has not executed successfully.
13. The processor of claim 12 further comprising:
a plurality of memory devices coupled to the execution units such that the checker determines whether the instruction has executed successfully based on a plurality of information provided by the memory devices.
14. The processor of claim 12 further comprising:
an allocator/renamer coupled to the replay queue to allocate and rename those of a plurality of resources needed by the plurality of instructions.
15. The processor of claim 14 further comprising:
a front end coupled to the allocator/renamer to provide the plurality of instructions to the allocator/renamer.
16. The processor of claim 14 further comprising:
a retire unit to retire the plurality of instructions, coupled to the checker to receive those of the plurality of instructions that have executed successfully, and coupled to the allocator/renamer to communicate a de-allocate signal to the allocator/renamer.
17. The processor of claim 16 wherein the retire unit is further coupled to the replay queue to communicate a retire signal when one of the plurality of instructions is retired.
18. A method comprising:
receiving an instruction of a plurality of instructions;
placing the instruction in a queue with other instructions of the plurality of instructions;
speculatively re-ordering those of the plurality of instructions in a scheduler based on data dependencies and the instruction latencies;
dispatching one of the plurality of instructions to an execution unit to be executed either when a counter reflecting the number of times the instruction has been executed does not exceed a maximum number of replays or, if the counter for the instruction exceeds the maximum number of replays, when a required data for the instruction is available;
executing the instruction;
determining whether the instruction executed successfully;
routing the instruction back to the queue if the instruction did not execute successfully; and
retiring the instruction if the instruction executed successfully.
19. The method of claim 18 further comprising:
allocating those of a plurality of system resources needed by the instruction.
20. The method of claim 19 wherein retiring comprises:
de-allocating those of the plurality of system resources used by the instruction being retired;
removing the instruction and a plurality of related data from the queue.
21. The method of claim 18, wherein the counter is one of a plurality of counters, one for each of the plurality of instructions in the scheduler, further comprising:
maintaining the plurality of counters, such that the counters reflect the number of times the corresponding instruction has been executed.
22. The method of claim 21 wherein each of the plurality of counters for the instruction is paired with each of the plurality of the instructions.
23. The method of claim 21 wherein the plurality of counters is stored in the scheduler.
US09/705,668 2000-11-02 2000-11-02 Breaking replay dependency loops in a processor using a rescheduled replay queue Expired - Fee Related US6981129B1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US09/705,668 US6981129B1 (en) 2000-11-02 2000-11-02 Breaking replay dependency loops in a processor using a rescheduled replay queue
EP01986210A EP1334426A2 (en) 2000-11-02 2001-10-18 Apparatus and method to reschedule instructions
PCT/US2001/050735 WO2002039269A2 (en) 2000-11-02 2001-10-18 Apparatus and method to reschedule instructions
CNB018198961A CN1294484C (en) 2000-11-02 2001-10-18 Breaking replay dependency loops in processor using rescheduled replay queue
AU2002236668A AU2002236668A1 (en) 2000-11-02 2001-10-18 Apparatus and method to reschedule instructions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/705,668 US6981129B1 (en) 2000-11-02 2000-11-02 Breaking replay dependency loops in a processor using a rescheduled replay queue

Publications (1)

Publication Number Publication Date
US6981129B1 true US6981129B1 (en) 2005-12-27

Family

ID=24834458

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/705,668 Expired - Fee Related US6981129B1 (en) 2000-11-02 2000-11-02 Breaking replay dependency loops in a processor using a rescheduled replay queue

Country Status (5)

Country Link
US (1) US6981129B1 (en)
EP (1) EP1334426A2 (en)
CN (1) CN1294484C (en)
AU (1) AU2002236668A1 (en)
WO (1) WO2002039269A2 (en)

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071563A1 (en) * 2003-09-30 2005-03-31 Belliappa Kuttanna Early data return indication mechanism
US20080028193A1 (en) * 2006-07-31 2008-01-31 Advanced Micro Devices, Inc. Transitive suppression of instruction replay
US7474698B2 (en) * 2001-10-19 2009-01-06 Sharp Laboratories Of America, Inc. Identification of replay segments
US20090024838A1 (en) * 2007-07-20 2009-01-22 Dhodapkar Ashutosh S Mechanism for suppressing instruction replay in a processor
US7657836B2 (en) 2002-07-25 2010-02-02 Sharp Laboratories Of America, Inc. Summarization of soccer video content
US7657907B2 (en) 2002-09-30 2010-02-02 Sharp Laboratories Of America, Inc. Automatic user profiling
US20100082953A1 (en) * 2008-09-30 2010-04-01 Faraday Technology Corp. Recovery apparatus for solving branch mis-prediction and method and central processing unit thereof
US7793205B2 (en) 2002-03-19 2010-09-07 Sharp Laboratories Of America, Inc. Synchronization of video and data
US7904814B2 (en) 2001-04-19 2011-03-08 Sharp Laboratories Of America, Inc. System for presenting audio-video content
US8018491B2 (en) 2001-08-20 2011-09-13 Sharp Laboratories Of America, Inc. Summarization of football video content
US8020183B2 (en) 2000-09-14 2011-09-13 Sharp Laboratories Of America, Inc. Audiovisual management system
US8028314B1 (en) 2000-05-26 2011-09-27 Sharp Laboratories Of America, Inc. Audiovisual information management system
US8028234B2 (en) 2002-01-28 2011-09-27 Sharp Laboratories Of America, Inc. Summarization of sumo video content
US8356317B2 (en) 2004-03-04 2013-01-15 Sharp Laboratories Of America, Inc. Presence based technology
US8606782B2 (en) 2001-02-15 2013-12-10 Sharp Laboratories Of America, Inc. Segmentation description scheme for audio-visual content
US8689253B2 (en) 2006-03-03 2014-04-01 Sharp Laboratories Of America, Inc. Method and system for configuring media-playing sets
US8776142B2 (en) 2004-03-04 2014-07-08 Sharp Laboratories Of America, Inc. Networked video devices
US20150026686A1 (en) * 2013-07-16 2015-01-22 Advanced Micro Devices, Inc. Dependent instruction suppression in a load-operation instruction
US8949899B2 (en) 2005-03-04 2015-02-03 Sharp Laboratories Of America, Inc. Collaborative recommendation system
US9256428B2 (en) 2013-02-06 2016-02-09 International Business Machines Corporation Load latency speculation in an out-of-order computer processor
US20160092212A1 (en) * 2014-09-30 2016-03-31 International Business Machines Corporation Dynamic issue masks for processor hang prevention
US9400653B2 (en) 2013-03-14 2016-07-26 Samsung Electronics Co., Ltd. System and method to clear and rebuild dependencies
US9489206B2 (en) 2013-07-16 2016-11-08 Advanced Micro Devices, Inc. Dependent instruction suppression
US20160350126A1 (en) * 2014-12-14 2016-12-01 VIA Alliance Seciconductor Co., Ltd. Mechanism to preclude load replays dependent on page walks in an out-of-order processor
US9547556B1 (en) * 2008-04-02 2017-01-17 Marvell International Ltd. Restart operation with logical blocks in queued commands
US9606806B2 (en) 2013-06-25 2017-03-28 Advanced Micro Devices, Inc. Dependence-based replay suppression
US9703359B2 (en) 2014-12-14 2017-07-11 Via Alliance Semiconductor Co., Ltd. Power saving mechanism to reduce load replays in out-of-order processor
US9715389B2 (en) 2013-06-25 2017-07-25 Advanced Micro Devices, Inc. Dependent instruction suppression
US9740271B2 (en) * 2014-12-14 2017-08-22 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude X86 special bus cycle load replays in an out-of-order processor
US9804845B2 (en) 2014-12-14 2017-10-31 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude X86 special bus cycle load replays in an out-of-order processor
US10083038B2 (en) 2014-12-14 2018-09-25 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on page walks in an out-of-order processor
US10089112B2 (en) 2014-12-14 2018-10-02 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on fuse array access in an out-of-order processor
US10088881B2 (en) 2014-12-14 2018-10-02 Via Alliance Semiconductor Co., Ltd Mechanism to preclude I/O-dependent load replays in an out-of-order processor
US10095514B2 (en) 2014-12-14 2018-10-09 Via Alliance Semiconductor Co., Ltd Mechanism to preclude I/O-dependent load replays in an out-of-order processor
US10108430B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on off-die control element access in an out-of-order processor
US10108420B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on long load cycles in an out-of-order processor
US10108428B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on long load cycles in an out-of-order processor
US10108421B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude shared ram-dependent load replays in an out-of-order processor
US10108429B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude shared RAM-dependent load replays in an out-of-order processor
US10108427B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on fuse array access in an out-of-order processor
US10114794B2 (en) 2014-12-14 2018-10-30 Via Alliance Semiconductor Co., Ltd Programmable load replay precluding mechanism
US10114646B2 (en) 2014-12-14 2018-10-30 Via Alliance Semiconductor Co., Ltd Programmable load replay precluding mechanism
US10120689B2 (en) 2014-12-14 2018-11-06 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on off-die control element access in an out-of-order processor
US10127046B2 (en) 2014-12-14 2018-11-13 Via Alliance Semiconductor Co., Ltd. Mechanism to preclude uncacheable-dependent load replays in out-of-order processor
US10133580B2 (en) 2014-12-14 2018-11-20 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude load replays dependent on write combining memory space access in an out-of-order processor
US10133579B2 (en) 2014-12-14 2018-11-20 Via Alliance Semiconductor Co., Ltd. Mechanism to preclude uncacheable-dependent load replays in out-of-order processor
US10146539B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd. Load replay precluding mechanism
US10146547B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude non-core cache-dependent load replays in an out-of-order processor
US10146540B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude load replays dependent on write combining memory space access in an out-of-order processor
US10146546B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd Load replay precluding mechanism
US20190004804A1 (en) * 2017-06-29 2019-01-03 Intel Corporation Methods and apparatus for handling runtime memory dependencies
US10175984B2 (en) 2014-12-14 2019-01-08 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude non-core cache-dependent load replays in an out-of-order processor
US10209996B2 (en) 2014-12-14 2019-02-19 Via Alliance Semiconductor Co., Ltd. Apparatus and method for programmable load replay preclusion
US10228944B2 (en) 2014-12-14 2019-03-12 Via Alliance Semiconductor Co., Ltd. Apparatus and method for programmable load replay preclusion
US10452434B1 (en) 2017-09-11 2019-10-22 Apple Inc. Hierarchical reservation station
US10678542B2 (en) * 2015-07-24 2020-06-09 Apple Inc. Non-shifting reservation station
US20200272751A1 (en) * 2019-02-21 2020-08-27 Coremedia Ag Method and apparatus for managing data in a content management system
CN117707995A (en) * 2024-02-02 2024-03-15 北京惠朗时代科技有限公司 Optimization device for data pre-reading and operation method

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040111593A1 (en) * 2002-12-05 2004-06-10 International Business Machines Corporation Interrupt handler prediction method and system
US7363470B2 (en) * 2003-05-02 2008-04-22 Advanced Micro Devices, Inc. System and method to prevent in-flight instances of operations from disrupting operation replay within a data-speculative microprocessor
US7657891B2 (en) * 2005-02-04 2010-02-02 Mips Technologies, Inc. Multithreading microprocessor with optimized thread scheduler for increasing pipeline utilization efficiency
US7613904B2 (en) 2005-02-04 2009-11-03 Mips Technologies, Inc. Interfacing external thread prioritizing policy enforcing logic with customer modifiable register to processor internal scheduler
US7490230B2 (en) 2005-02-04 2009-02-10 Mips Technologies, Inc. Fetch director employing barrel-incrementer-based round-robin apparatus for use in multithreading microprocessor
US7664936B2 (en) 2005-02-04 2010-02-16 Mips Technologies, Inc. Prioritizing thread selection partly based on stall likelihood providing status information of instruction operand register usage at pipeline stages
US7631130B2 (en) 2005-02-04 2009-12-08 Mips Technologies, Inc Barrel-incrementer-based round-robin apparatus and instruction dispatch scheduler employing same for use in multithreading microprocessor
US7506140B2 (en) 2005-02-04 2009-03-17 Mips Technologies, Inc. Return data selector employing barrel-incrementer-based round-robin apparatus
US7853777B2 (en) 2005-02-04 2010-12-14 Mips Technologies, Inc. Instruction/skid buffers in a multithreading microprocessor that store dispatched instructions to avoid re-fetching flushed instructions
WO2012104896A1 (en) * 2011-01-31 2012-08-09 トヨタ自動車株式会社 Safety control device and safety control method
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
US9715390B2 (en) * 2015-04-19 2017-07-25 Centipede Semi Ltd. Run-time parallelization of code execution based on an approximate register-access specification
CN112202654B (en) * 2020-09-22 2022-08-02 广州河东科技有限公司 Control instruction processing method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3603934A (en) * 1968-07-15 1971-09-07 Ibm Data processing system capable of operation despite a malfunction
US5784587A (en) 1996-06-13 1998-07-21 Hewlett-Packard Company Method and system for recovering from cache misses
WO1999031589A1 (en) 1997-12-16 1999-06-24 Intel Corporation Out-of-pipeline trace buffer for instruction replay following misspeculation
US5944818A (en) * 1996-06-28 1999-08-31 Intel Corporation Method and apparatus for accelerated instruction restart in a microprocessor
US5966544A (en) 1996-11-13 1999-10-12 Intel Corporation Data speculatable processor having reply architecture
WO2000041070A1 (en) 1998-12-30 2000-07-13 Intel Corporation A computer processor having a replay unit
US6496925B1 (en) * 1999-12-09 2002-12-17 Intel Corporation Method and apparatus for processing an event occurrence within a multithreaded processor
US6535905B1 (en) * 1999-04-29 2003-03-18 Intel Corporation Method and apparatus for thread switching within a multithreaded processor
US6542921B1 (en) * 1999-07-08 2003-04-01 Intel Corporation Method and apparatus for controlling the processing priority between multiple threads in a multithreaded processor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1206145A (en) * 1997-06-30 1999-01-27 索尼公司 Signal processor having pipeline processing circuit and method of the same

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3603934A (en) * 1968-07-15 1971-09-07 Ibm Data processing system capable of operation despite a malfunction
US5784587A (en) 1996-06-13 1998-07-21 Hewlett-Packard Company Method and system for recovering from cache misses
US5944818A (en) * 1996-06-28 1999-08-31 Intel Corporation Method and apparatus for accelerated instruction restart in a microprocessor
US5966544A (en) 1996-11-13 1999-10-12 Intel Corporation Data speculatable processor having reply architecture
US6212626B1 (en) * 1996-11-13 2001-04-03 Intel Corporation Computer processor having a checker
WO1999031589A1 (en) 1997-12-16 1999-06-24 Intel Corporation Out-of-pipeline trace buffer for instruction replay following misspeculation
WO2000041070A1 (en) 1998-12-30 2000-07-13 Intel Corporation A computer processor having a replay unit
US6535905B1 (en) * 1999-04-29 2003-03-18 Intel Corporation Method and apparatus for thread switching within a multithreaded processor
US6542921B1 (en) * 1999-07-08 2003-04-01 Intel Corporation Method and apparatus for controlling the processing priority between multiple threads in a multithreaded processor
US6496925B1 (en) * 1999-12-09 2002-12-17 Intel Corporation Method and apparatus for processing an event occurrence within a multithreaded processor

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Johnson, M.: "Out-of-Order Issue" Chapter 7, Superscalar Microprocessor Design, Englewoods Cliffs, NJ, U.S., pp. 127-146 XP002111569.
Johnson, Mike. Superscalar Microprocessor Design. Englewood Cliffs, NJ: Prentice Hall, Inc., (C)1991. pp. 133-134. *
PCT International Search Report dated Jun. 21, 2002.
PCT International Search Report dated Sep. 18, 2002.

Cited By (77)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8028314B1 (en) 2000-05-26 2011-09-27 Sharp Laboratories Of America, Inc. Audiovisual information management system
US8020183B2 (en) 2000-09-14 2011-09-13 Sharp Laboratories Of America, Inc. Audiovisual management system
US8606782B2 (en) 2001-02-15 2013-12-10 Sharp Laboratories Of America, Inc. Segmentation description scheme for audio-visual content
US7904814B2 (en) 2001-04-19 2011-03-08 Sharp Laboratories Of America, Inc. System for presenting audio-video content
US8018491B2 (en) 2001-08-20 2011-09-13 Sharp Laboratories Of America, Inc. Summarization of football video content
US7474698B2 (en) * 2001-10-19 2009-01-06 Sharp Laboratories Of America, Inc. Identification of replay segments
US7653131B2 (en) 2001-10-19 2010-01-26 Sharp Laboratories Of America, Inc. Identification of replay segments
US8028234B2 (en) 2002-01-28 2011-09-27 Sharp Laboratories Of America, Inc. Summarization of sumo video content
US7793205B2 (en) 2002-03-19 2010-09-07 Sharp Laboratories Of America, Inc. Synchronization of video and data
US8214741B2 (en) 2002-03-19 2012-07-03 Sharp Laboratories Of America, Inc. Synchronization of video and data
US7853865B2 (en) 2002-03-19 2010-12-14 Sharp Laboratories Of America, Inc. Synchronization of video and data
US7657836B2 (en) 2002-07-25 2010-02-02 Sharp Laboratories Of America, Inc. Summarization of soccer video content
US7657907B2 (en) 2002-09-30 2010-02-02 Sharp Laboratories Of America, Inc. Automatic user profiling
US20070028048A1 (en) * 2003-09-30 2007-02-01 Belliappa Kuttanna Early data return indication mechanism
US7451295B2 (en) 2003-09-30 2008-11-11 Intel Corporation Early data return indication mechanism for data cache to detect readiness of data via an early data ready indication by scheduling, rescheduling, and replaying of requests in request queues
US20050071563A1 (en) * 2003-09-30 2005-03-31 Belliappa Kuttanna Early data return indication mechanism
US7111153B2 (en) * 2003-09-30 2006-09-19 Intel Corporation Early data return indication mechanism
US20090043965A1 (en) * 2003-09-30 2009-02-12 Belliappa Kuttanna Early data return indication mechanism
US8776142B2 (en) 2004-03-04 2014-07-08 Sharp Laboratories Of America, Inc. Networked video devices
US8356317B2 (en) 2004-03-04 2013-01-15 Sharp Laboratories Of America, Inc. Presence based technology
US8949899B2 (en) 2005-03-04 2015-02-03 Sharp Laboratories Of America, Inc. Collaborative recommendation system
US8689253B2 (en) 2006-03-03 2014-04-01 Sharp Laboratories Of America, Inc. Method and system for configuring media-playing sets
US20080028193A1 (en) * 2006-07-31 2008-01-31 Advanced Micro Devices, Inc. Transitive suppression of instruction replay
US7502914B2 (en) * 2006-07-31 2009-03-10 Advanced Micro Devices, Inc. Transitive suppression of instruction replay
US20090024838A1 (en) * 2007-07-20 2009-01-22 Dhodapkar Ashutosh S Mechanism for suppressing instruction replay in a processor
US7861066B2 (en) 2007-07-20 2010-12-28 Advanced Micro Devices, Inc. Mechanism for predicting and suppressing instruction replay in a processor
US9547556B1 (en) * 2008-04-02 2017-01-17 Marvell International Ltd. Restart operation with logical blocks in queued commands
US20100082953A1 (en) * 2008-09-30 2010-04-01 Faraday Technology Corp. Recovery apparatus for solving branch mis-prediction and method and central processing unit thereof
US7945767B2 (en) * 2008-09-30 2011-05-17 Faraday Technology Corp. Recovery apparatus for solving branch mis-prediction and method and central processing unit thereof
US9256428B2 (en) 2013-02-06 2016-02-09 International Business Machines Corporation Load latency speculation in an out-of-order computer processor
US9262160B2 (en) 2013-02-06 2016-02-16 International Business Machines Corporation Load latency speculation in an out-of-order computer processor
US9400653B2 (en) 2013-03-14 2016-07-26 Samsung Electronics Co., Ltd. System and method to clear and rebuild dependencies
US10552157B2 (en) 2013-03-14 2020-02-04 Samsung Electronics Co., Ltd. System and method to clear and rebuild dependencies
US9606806B2 (en) 2013-06-25 2017-03-28 Advanced Micro Devices, Inc. Dependence-based replay suppression
US9715389B2 (en) 2013-06-25 2017-07-25 Advanced Micro Devices, Inc. Dependent instruction suppression
US9483273B2 (en) * 2013-07-16 2016-11-01 Advanced Micro Devices, Inc. Dependent instruction suppression in a load-operation instruction
US9489206B2 (en) 2013-07-16 2016-11-08 Advanced Micro Devices, Inc. Dependent instruction suppression
US20150026686A1 (en) * 2013-07-16 2015-01-22 Advanced Micro Devices, Inc. Dependent instruction suppression in a load-operation instruction
US20160092233A1 (en) * 2014-09-30 2016-03-31 International Business Machines Corporation Dynamic issue masks for processor hang prevention
US20160092212A1 (en) * 2014-09-30 2016-03-31 International Business Machines Corporation Dynamic issue masks for processor hang prevention
US10108426B2 (en) * 2014-09-30 2018-10-23 International Business Machines Corporation Dynamic issue masks for processor hang prevention
US10102002B2 (en) * 2014-09-30 2018-10-16 International Business Machines Corporation Dynamic issue masks for processor hang prevention
US9915998B2 (en) 2014-12-14 2018-03-13 Via Alliance Semiconductor Co., Ltd Power saving mechanism to reduce load replays in out-of-order processor
US10120689B2 (en) 2014-12-14 2018-11-06 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on off-die control element access in an out-of-order processor
US9740271B2 (en) * 2014-12-14 2017-08-22 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude X86 special bus cycle load replays in an out-of-order processor
US10083038B2 (en) 2014-12-14 2018-09-25 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on page walks in an out-of-order processor
US10089112B2 (en) 2014-12-14 2018-10-02 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on fuse array access in an out-of-order processor
US10088881B2 (en) 2014-12-14 2018-10-02 Via Alliance Semiconductor Co., Ltd Mechanism to preclude I/O-dependent load replays in an out-of-order processor
US10095514B2 (en) 2014-12-14 2018-10-09 Via Alliance Semiconductor Co., Ltd Mechanism to preclude I/O-dependent load replays in an out-of-order processor
US9703359B2 (en) 2014-12-14 2017-07-11 Via Alliance Semiconductor Co., Ltd. Power saving mechanism to reduce load replays in out-of-order processor
US10108430B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on off-die control element access in an out-of-order processor
US10108420B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on long load cycles in an out-of-order processor
US10108428B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on long load cycles in an out-of-order processor
US10108421B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude shared ram-dependent load replays in an out-of-order processor
US9645827B2 (en) * 2014-12-14 2017-05-09 Via Alliance Semiconductor Co., Ltd. Mechanism to preclude load replays dependent on page walks in an out-of-order processor
US10108429B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude shared RAM-dependent load replays in an out-of-order processor
US10108427B2 (en) 2014-12-14 2018-10-23 Via Alliance Semiconductor Co., Ltd Mechanism to preclude load replays dependent on fuse array access in an out-of-order processor
US10114794B2 (en) 2014-12-14 2018-10-30 Via Alliance Semiconductor Co., Ltd Programmable load replay precluding mechanism
US10114646B2 (en) 2014-12-14 2018-10-30 Via Alliance Semiconductor Co., Ltd Programmable load replay precluding mechanism
US9804845B2 (en) 2014-12-14 2017-10-31 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude X86 special bus cycle load replays in an out-of-order processor
US10127046B2 (en) 2014-12-14 2018-11-13 Via Alliance Semiconductor Co., Ltd. Mechanism to preclude uncacheable-dependent load replays in out-of-order processor
US10133580B2 (en) 2014-12-14 2018-11-20 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude load replays dependent on write combining memory space access in an out-of-order processor
US10133579B2 (en) 2014-12-14 2018-11-20 Via Alliance Semiconductor Co., Ltd. Mechanism to preclude uncacheable-dependent load replays in out-of-order processor
US10146539B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd. Load replay precluding mechanism
US10146547B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd. Apparatus and method to preclude non-core cache-dependent load replays in an out-of-order processor
US10146540B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude load replays dependent on write combining memory space access in an out-of-order processor
US10146546B2 (en) 2014-12-14 2018-12-04 Via Alliance Semiconductor Co., Ltd Load replay precluding mechanism
US20160350126A1 (en) * 2014-12-14 2016-12-01 VIA Alliance Seciconductor Co., Ltd. Mechanism to preclude load replays dependent on page walks in an out-of-order processor
US10175984B2 (en) 2014-12-14 2019-01-08 Via Alliance Semiconductor Co., Ltd Apparatus and method to preclude non-core cache-dependent load replays in an out-of-order processor
US10209996B2 (en) 2014-12-14 2019-02-19 Via Alliance Semiconductor Co., Ltd. Apparatus and method for programmable load replay preclusion
US10228944B2 (en) 2014-12-14 2019-03-12 Via Alliance Semiconductor Co., Ltd. Apparatus and method for programmable load replay preclusion
US10678542B2 (en) * 2015-07-24 2020-06-09 Apple Inc. Non-shifting reservation station
US20190004804A1 (en) * 2017-06-29 2019-01-03 Intel Corporation Methods and apparatus for handling runtime memory dependencies
US11379242B2 (en) * 2017-06-29 2022-07-05 Intel Corporation Methods and apparatus for using load and store addresses to resolve memory dependencies
US10452434B1 (en) 2017-09-11 2019-10-22 Apple Inc. Hierarchical reservation station
US20200272751A1 (en) * 2019-02-21 2020-08-27 Coremedia Ag Method and apparatus for managing data in a content management system
CN117707995A (en) * 2024-02-02 2024-03-15 北京惠朗时代科技有限公司 Optimization device for data pre-reading and operation method

Also Published As

Publication number Publication date
CN1294484C (en) 2007-01-10
AU2002236668A1 (en) 2002-05-21
CN1478228A (en) 2004-02-25
WO2002039269A2 (en) 2002-05-16
EP1334426A2 (en) 2003-08-13
WO2002039269A3 (en) 2003-01-23

Similar Documents

Publication Publication Date Title
US6981129B1 (en) Breaking replay dependency loops in a processor using a rescheduled replay queue
US6877086B1 (en) Method and apparatus for rescheduling multiple micro-operations in a processor using a replay queue and a counter
US7219349B2 (en) Multi-threading techniques for a processor utilizing a replay queue
US6163838A (en) Computer processor with a replay system
US7200737B1 (en) Processor with a replay system that includes a replay queue for improved throughput
US6912648B2 (en) Stick and spoke replay with selectable delays
US6493820B2 (en) Processor having multiple program counters and trace buffers outside an execution pipeline
US5778245A (en) Method and apparatus for dynamic allocation of multiple buffers in a processor
US6772324B2 (en) Processor having multiple program counters and trace buffers outside an execution pipeline
US6240509B1 (en) Out-of-pipeline trace buffer for holding instructions that may be re-executed following misspeculation
US7502912B2 (en) Method and apparatus for rescheduling operations in a processor
US6094717A (en) Computer processor with a replay system having a plurality of checkers
US7159154B2 (en) Technique for synchronizing faults in a processor having a replay system
US6212626B1 (en) Computer processor having a checker
US6393550B1 (en) Method and apparatus for pipeline streamlining where resources are immediate or certainly retired
US20040128484A1 (en) Method and apparatus for transparent delayed write-back
US11829767B2 (en) Register scoreboard for a microprocessor with a time counter for statically dispatching instructions
US20050147036A1 (en) Method and apparatus for enabling an adaptive replay loop in a processor
US9858075B2 (en) Run-time code parallelization with independent speculative committing of instructions per segment
US20230315474A1 (en) Microprocessor with apparatus and method for replaying instructions
US11829762B2 (en) Time-resource matrix for a microprocessor with time counter for statically dispatching instructions
US11954491B2 (en) Multi-threading microprocessor with a time counter for statically dispatching instructions
US11829187B2 (en) Microprocessor with time counter for statically dispatching instructions
US20230350680A1 (en) Microprocessor with baseline and extended register sets
US20230273796A1 (en) Microprocessor with time counter for statically dispatching instructions with phantom registers

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BOGGS, DARRELL D.;CARMEAN, DOUGLAS M.;HAMMARLUND, PER H.;AND OTHERS;REEL/FRAME:011269/0951;SIGNING DATES FROM 20001030 TO 20001031

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.)

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20171227