WO2009156920A1 - Method, register file system, and processing unit device enabling substantially direct cache memory access - Google Patents
Method, register file system, and processing unit device enabling substantially direct cache memory access Download PDFInfo
- Publication number
- WO2009156920A1 WO2009156920A1 PCT/IB2009/052610 IB2009052610W WO2009156920A1 WO 2009156920 A1 WO2009156920 A1 WO 2009156920A1 IB 2009052610 W IB2009052610 W IB 2009052610W WO 2009156920 A1 WO2009156920 A1 WO 2009156920A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- cache
- cache memory
- data
- cacheable
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
Definitions
- the present invention relates to processing units. More particularly, the present invention relates to a method, register file system, and processing unit device configured to enable substantially direct access to cache memory means that are, at least partially, provided within said register file system (and within said processing unit device).
- Cache memory is a temporary storage memory, where relatively frequent-accessed data can be stored for achieving more rapid data access. Once the data (a copy of the data) is stored in the cache memory, future use can be made by accessing said data in the cache memory rather than refetching or recomputing the original data, so that the average access time to said data is relatively shorter.
- Cache Line a block of memory that is transferred between a memory means (e.g., program/system memory, such as SRAM (Static Random Access Memory)) and a cache memory.
- the cache line is usually fixed in size, ranging for example from 16 to 256 bytes.
- the effectiveness of the cache line size depends on the application, and cache circuits may be configurable to a different cache line size. Also, there are several conventional algorithms for dynamically adjusting a cache line size.
- Cacheable Memory memory that can be used as the cache memory, i.e. memory that can be cached.
- Instruction Register a register that stores a current instruction to be executed.
- the instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).
- Opcode an opcode (operation code) is the portion of an instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).
- an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.
- an operation/processing e.g., addition, subtraction, and the like
- Register File is a storage unit/system located within the processing unit, such as the CPU (Central Processing Unit). Generally, the register file is a combination of registers and combinatorial logic.
- a conventional central processing unit operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing.
- the first step, fetching involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program.
- the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly.
- the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU.
- the instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until said instruction is fetched.
- the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU).
- ISA CPU instruction set architecture
- opcode operation code
- the remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation).
- operands may be given as a constant value (called an immediate value).
- operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).
- the executing step is performed.
- the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs.
- the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations.
- the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.
- a conventional processing unit e.g., CPU
- uses a limited set of registers e.g., 32 or 64 registers
- the register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like.
- this register file is a portion of the CPU, and it is located in the physical proximity to the ALU (Arithmetic Logic Unit) of said CPU.
- ALU Arimetic Logic Unit
- Such a register file can be named a "local register file system".
- One of the reasons for having a limited local register file system is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file system) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file system) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination” input of the local register file system) storing the result value of the ALU processing (e.g., the sum of the values stored within said "sources 1 and 2"). Since the CPU program word is limited (in terms of data bits), the number of bits allowed for each of the above registers is low.
- VLIW Very Large Instruction Word
- the CPU register file relatively rarely reaches a capacity of 256 registers.
- a conventional system that implements CPU also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA (Direct Memory Access) engine, communication controllers (e.g., a USB controller), and the like.
- the memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.
- the CPU local register file system is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file system (e.g., cache, secondary memory, etc.).
- the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file system.
- the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file system, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral.
- a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers.
- the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing.
- a multi- LOAD request can be generated in a single CPU clock cycle and then, in an additional clock cycle, one or more (destination) registers within the local register file system can be updated with new data values (with the results of ALU processing).
- LOAD or multi-LOAD command for loading data from an external register/memory means (device register, cache memory, etc.
- ALU data processing command for executing various operations (e.g., addition, subtraction);
- STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.
- the DMA Direct Memory Access
- CPU CPU peripherals
- DMA operations can be conducted in parallel with CPU operations.
- dedicated hardware is required and the DMA engines need to be configured and enabled by the CPU; further, it is applicable only when no data processing (or substantially negligible data processing) is required.
- the cache memory means are divided into cache lines.
- a CPU generates LOAD/STORE command (containing a memory address) for loading data from said memory address or storing data into said address.
- the cache memory (system) determines whether said memory address corresponds to it, and if so, whether the data related to said memory address is stored within one of its cache lines.
- the memory address is related to the cache memory and the corresponding data is stored within said one of the cache lines, then it is called a "hit”.
- the memory address is not related to the cache memory, or when it is related to the cache memory, but the corresponding data is not stored within said one of the cache lines, then it is called a "miss".
- a CPU is stalled in case of a cache miss, waiting for a cache hit.
- another thread can be executed while waiting for the cache hit.
- the corresponding external memory data (to which said memory address is related) needs to be cached for filling a cache line.
- a cache write back is executed, in order to free memory space for the new data. For this, the data stored within a cache line (inside the cache memory) is copied into the external memory (e.g., SDRAM (Synchronous Dynamic Random Access Memory)).
- US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution.
- the different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures.
- the cache memories may be directly accessed by using cache addresses.
- one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for accessing and manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct cache memory access for one or more CPU execution units (for processing the data).
- the number of instructions and corresponding CPU clock cycles for processing the data that is stored within the cache memory can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling to generate multiple requests (e.g., up to three requests) for accessing said cache memory.
- Another advantage of the present invention is that it provides a method, register file system, and processing unit (device), in which for reducing a number of instructions and CPU clock cycles required for manipulating/processing memory mapped data, there is substantially no need in changing the structure of the conventional CPU program word (conventional CPU instruction).
- Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.
- a further advantage of the present invention is that it provides a method and processing unit (device), in which CPU stalls (delays) are substantially prevented.
- the present invention relates to a method, register file system, and processing unit device configured to enable substantially direct access to cache memory means that are, at least partially, provided within said register file system (and within said processing unit device), thereby enabling substantially direct connectivity (communication) between said cache memory means and one or more execution units (such as ALUs).
- the register file system comprises: a) at least one cache memory system comprising a plurality of cache memory units assigned with memory addresses, said cache memory system configured to: a.l. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory: a.1.1.1. read data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the read data; and a.1.1.2. write data into the one or more cache memory units that corresponds to said at least one memory address; and a.1.2.
- the outputted cache memory data is provided to at least one execution unit that is configured to receive said outputted cache memory data and process it, and further configured to write back data into the one or more cache memory units.
- the cacheable memory comprises one or more external cacheable memory units.
- the register file system further comprises a memory controller for enabling conveying data from the one or more external cacheable memory units into the one or more cache memory units.
- the register file system further comprises a memory controller for enabling writing back data from the one or more cache memory units into the one or more external cacheable memory units.
- the memory addresses that are received by means of the cache memory system are processing unit (PU) mapped addresses.
- PU processing unit
- the received PU mapped addresses are decoded or converted by means of one or more of the following: (a) the register file system; and (b) the cache memory system.
- the cache memory system further comprises a hit/miss resolve unit for determining whether the received one or more addresses relate to the cacheable memory and whether there is a cache hit request or cache miss request.
- the cache memory system further comprises a cache controller unit for controlling the operation of the cache memory system.
- the cache memory system further comprises one or more cache request handler units for determining for each cache miss request whether it is a cache miss on reading the data or on writing the data.
- the cache memory system further comprises a memory interface for enabling said cache memory system to interact with the one or more external cacheable memory units.
- the cache memory system further comprises a data cache memory device provided with the plurality of data cache memory units for storing data.
- the cache memory system further comprises a Least Recently Used (LRU) control unit for determining whether there is available storage space within the data cache memory device.
- LRU Least Recently Used
- the LRU control unit allocates the required storage space within the data cache memory device.
- the LRU control unit enables to perform a write back of data from the data cache memory device into the one or more external cacheable memory units.
- the cache memory system further comprises a task ready queue unit for keeping a status of current cache miss requests being handled by the cache memory system.
- the cache memory system further comprises a cache request queue unit configured to maintain a queue of cache miss requests.
- the processing unit (PU) device comprises: a) a register file system provided with: a.l. at least one cache memory system comprising a plurality of cache memory units having respective memory addresses, said cache memory system configured to: a.1.1. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory:
- the method of operating with a cache memory system comprises: a) receiving at least one memory address; b) determining whether said at least one memory address is related to a cacheable memory comprising at least a plurality of cache memory units; and b.l. performing one or more of the following, if said at least one memory address is related to said cacheable memory: b.1.1. reading and outputting data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the outputted cache memory data, and then receiving said outputted cache memory data and processing it; and b.l.2.
- Fig. IA is a schematic illustration of a spread register file system, according to an embodiment of the present invention.
- Fig. IB is a schematic illustration of a spread register file system, according to another embodiment of the present invention.
- Fig. 2 is a schematic illustration of a Data Cache Memory system (e.g., a peripheral/memory means), incorporating converters (decoders) for converting (decoding) one or more inputted CPU mapped addresses, according to an embodiment of the present invention
- Fig. 3 is a sample flow chart of a Data Cache Memory system operation, according to an embodiment of the present invention.
- the term "spread register file” system or “SRF” system refers to the expanded (spread) register file according to the present invention, which can be related to the entire CPU memory map, thereby enabling substantially direct memory/peripheral access for one or more CPU execution units (for processing the data).
- the term "local register file system” or “LRF system” refers to the conventional CPU local register file system.
- the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on-chip/off-chip peripherals/memories (or registers) and any other conventional memory means.
- CPU central processing unit
- processing or a similar term
- data operation such as data manipulation, data transfer, addition or subtraction of data and the like.
- co-pending US provisional patent application no. 61/071,584, titled “Register File System and Method Thereof for Enabling a Substantially Direct Memory Access” discloses a processing unit (e.g., CPU, microprocessor and the like) that implements a SRF system and enables substantially direct memory access for one or more (CPU) execution units (for processing the data).
- a processing unit e.g., CPU, microprocessor and the like
- Improved Processing Unit Implementing both a Local Register File System and Spread Register File System teaches implementing both a conventional local register file system (LRF system) and spread register file system (SRF system) to be used by one or more processing units.
- LRF system local register file system
- SRF system spread register file system
- the present invention presents a method, register file system, and processing unit (device) configured to enable substantially direct access to a cache memory (system) that is, at least partially, provided within said register file system (and within said processing unit (device)), thereby enabling substantially direct connectivity (communication) between said cache memory (system) and one or more execution units (such as ALUs).
- a cache memory system
- ALUs execution units
- Fig. IA is a schematic illustration of spread register file system 206, according to an embodiment of the present invention.
- Spread register file (SRF) system 206 receives at its input ports (not shown) one or more of the following addresses: CPU mapped "source 1" address (MSl address) provided to said SRF system over bus (line) 151, CPU mapped "source 2" address (MS2 address) provided over line 152, and CPU mapped "destination" address (MD address) provided over line 153 (each address is, for example, 32 bits long).
- MSl address CPU mapped "source 1" address
- MS2 address CPU mapped "source 2" address
- MD address CPU mapped "destination" address
- these CPU mapped addresses are converted (decoded) by means of address converters (decoders) 320, 321 and 322, respectively, to addresses of corresponding peripheral/memory means, such as peripherals/memory means 301, 302, 303, ..., 310 (that can be, for example, cache memories, DDR (Double Data Rate) memories, SRAM (Static Random Access Memory) memories, hard disks, and any other memory means or any combination thereof).
- said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions.
- the above address can be converted according to an instruction (program word) opcode inputted from an instruction register (not shown) into a control unit 350 (operatively coupled to the peripherals/memory means), which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130 (e.g., ALU): for example, if said instruction opcode relates to moving "source 1" data to the "destination" register/memory cell(s), then only address converters 320 and 322 can be activated.
- instruction program word
- the converted "source 1" and “source 2" addresses are inputted into corresponding peripheral/memory means (e.g., Data Cache Memory system 301), which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232.
- peripheral/memory means e.g., Data Cache Memory system 301
- said data is processed (executed) by means of one or more execution units 130 (such as ALUs).
- the processed data (processing result) is provided over write back bus 233 to one or more peripheral/memory means (e.g., Data Cache Memory system 301 or 302) to be stored in corresponding converted destination addresses (CD addresses) within said one or more peripheral/memory means.
- the "source 1", “source 2”, and “destination” memory cells can be physically located within the same or within different peripheral/memory means.
- address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary “0” or “1") to each of said peripheral/memory means 301, 302, 303, ..., or 310 (for enabling reading or writing from or to said peripheral/memory means (data units) 301, 302, 303, ..., or 310 (N)).
- WE/CS commands can be provided to each of said peripheral/memory means when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302, 303, ..., or 310 (N).
- CS read command
- WE write command
- address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", “source 2" and “destination” addresses into corresponding peripheral/memory means addresses.
- Fig. IB is a schematic illustration of spread register file system 206, according to another embodiment of the present invention.
- one or more peripherals/memory means such as Data Cache Memory system 301 can comprise address converters (decoders) 325 and as a result the address conversion (or the address decoding) is performed within said Data Cache Memory system 301.
- the one or more peripherals/memory means, such as Data Cache Memory system 301 can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses, in which the required data is stored (or is to be stored).
- WE/CS Enablers 320', 321' and 322' (which can be further unified in a single WE/CS Enabler) provide WE/CS commands to said Data Cache Memory system 301, and do not perform the address conversion.
- WE/CS commands can be generated by means of control unit 350.
- Data Cache Memory systems 301 receives a CPU mapped address and determines by means of the integrated address converter(s)/decoder(s) 325 (e.g., according to predefined base-addresses of said Data Cache Memory system 301), whether each received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means. It should be noted that the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.
- Fig. 2 is a schematic illustration of a Data Cache Memory system (e.g., peripheral/memory means 301), incorporating a converter(s) (decoder(s)) 325 for converting (decoding) one or more inputted CPU mapped addresses, according to an embodiment of the present invention.
- a Data Cache Memory system e.g., peripheral/memory means 301
- a converter(s) (decoder(s)) 325 for converting (decoding) one or more inputted CPU mapped addresses, according to an embodiment of the present invention.
- the Data Cache Memory system comprises: a Hit/Miss Resolve unit 110 configured to determine whether CPU mapped addresses (inputted into said Data Cache Memory system 301 over lines 151, 152 and 153) relate to a cacheable or non-cacheable memory and whether there is a cache hit or cache miss, and further configured to send corresponding signals/commands to SRF Control unit 350 and to internal units of Data Cache Memory system 301; a Cache Controller unit 120 configured to control the operation of Data Cache Memory system 301, comprising: one or more Cache Request Handler units 121, 122, 123, etc.
- a cache miss configured to determine for each cache miss whether it is a cache miss on reading the data or on writing the data, and comprising a LRU (Least Recently Used) Control unit 127 configured to determine whether there is an available storage space within said Data Cache Memory system 301, and if there is no (or insufficient) available storage space, then allocating the required storage space within said Data Cache Memory system 301 by enabling to perform a write back process; a Memory Interface 130 configured to enable Data Cache Memory system 301 to interact with an external memory (e.g., peripherals/memory means 302, 303 (Fig.
- an external memory e.g., peripherals/memory means 302, 303
- Data Cache Memory (device) 140 comprising a plurality of Data Cache Memory units 141, 142, 143, etc. for storing the data; and a Task Ready Queue unit 150 configured to keep a status of current cache misses being handled by Data Cache Memory system 301, and to send an indication to SRF Control unit 350 when said Data Cache Memory system 301 accomplishes handling such cache misses.
- a Cache Request Queue unit 125 can be provided within said Cache Controller 120 for maintaining a queue of cache miss requests.
- CPU mapped "source 1" address, "source 2" and “destination” addresses are inputted into Data Hit/Miss Resolve unit 110 over lines 151, 152 and 153, respectively. Then, Hit/Miss Resolve unit 110 checks whether said "source 1", “source 2" and “destination” addresses are related to the memory that is defined as the cacheable memory, which can further comprise external memories, such as peripherals/memory means 302, 303, etc.. If so, then Hit/Miss Resolve unit 110 checks whether the data that corresponds to said addresses is cached in, i.e., within its Data Cache Memory 140 that has a plurality of Data Cache Memory units 141, 142, etc.
- said CPU mapped "source 1", “source 2" and/or “destination” addresses are converted (decoded) by means of converters (decoders) 325, which can be provided, for example, within Hit/Miss Resolve unit 110, giving rise to "source 1", “source 2" and “destination” converted addresses.
- Hit/Miss Resolve unit 110 conducts a search within a Data Cache Memory database 111, looking for said converted "source 1", “source 2" and "destination” addresses (being related to the cacheable memory). After conducting a search, Hit/Miss Resolve unit 110 has an indication whether each of said converted addresses is related to Data Cache Memory 140 or to external memories, such as peripherals/memory means 302, 303, etc. Further, Hit/Miss Resolve unit 110 receives an indication from said database 111 to exactly what Data Cache Memory unit or external peripherals/memory means said each converted address is related. It should be noted that database 111 is continuously updated by means of Cache Controller 120.
- Hit/Miss Resolve unit 110 conducts multiple searches (for each converted address) within database 111 in parallel.
- a cache hit occurs if all converted/decoded addresses (e.g., "source 1", “source 2" and/or "destination” converted addresses) are cacheable and cached in, i.e. all converted/decoded addresses are related to one or more Data Cache Memory units 141, 142, 143, etc. If at least one converted address is not cached in, while such an address is related to a cacheable memory, then it is considered to be a cache miss. It should be noted that a single ALU instruction may result, for example, in up to three cache misses (each cache miss for each converted address).
- Control unit 350 enables to continue the CPU pipeline execution substantially without any CPU stalls. Also, if at least one of the converted (decoded) addresses is related to a cacheable memory, and there is a cache hit for this at least one converted address, then the pipeline execution is also not stalled and is continued accordingly. On the other hand, if there is a cache miss, then according to an embodiment of the present invention, the execution is stalled, and a cache miss indication is forwarded from said Hit/Miss Resolve unit 110 to Cache Controller 120 over line 113. In addition, corresponding cache hit/miss indications are sent to SRF Control unit 350 over line 106.
- a Cache Request Queue unit 125 can be provided within said Cache Controller 120 for maintaining a queue of cache miss requests.
- a cache miss indication (signal/command) is received by Cache Controller 120, then one or more Cache Request Handler units 121, 122, etc. check whether it is a cache miss on reading or writing the data, i.e. whether a cache miss is related to reading the data from the "source 1" or "source 2" address, or is related to writing the data into a "destination" address.
- LRU (Least Recently Used) Control unit 127 checks whether there are available memory cells in one or more Data Cache Memory units 141, 142, etc. for storing the data (in case of a cache miss on reading the data or in case of a cache miss on writing the data).
- LRU Control unit 127 controls allocation of the available cache memory. For this, LRU control unit 127 accesses Data Cache Memory Database 111 that contains information regarding the available cache memory, and is continuously updated by means of Cache Controller 120. Also, LRU control unit 127 holds the information regarding cache lines (of the Data Cache Memory units 141, 142, 143, etc.), which have not been used for a relatively long period of time (the "Least Recently Used" cache lines).
- Cache Controller 120 issues a fetch request to Memory Interface 130, which reads the corresponding "missed" data from the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like) by using Memory Controller 131. Following this, the read data is written into the selected cache line(s) of Data Cache Memory units (such as units 141, 142 and 143).
- External/System Memory e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like
- Data Cache Memory Database 111 is updated with the new cache line(s) allocations.
- Hit/Miss Resolve unit 110 it will be interpreted by means of Hit/Miss Resolve unit 110 as a cache "hit".
- LRU Control unit 127 selects a Data Cache Memory unit, which is the least recently used, for storing the required data.
- LRU Control unit 127 enables performing a cache write back process of the least recently used data; it should be noted that during the cache write back, the least recently used data is written back from one or more corresponding Data Cache Memory units (from one or more cache lines) into the External Memory by means of Memory Interface 130.
- the write back process is performed (in case when there is no (or insufficient) available memory storage space) both when there is a cache miss on reading the data (e.g., from the converted "source 1" or “source 2" addresses) and when there is a cache miss on writing the data (e.g., to the converted "destination" address).
- LRU Control unit 127 selects said least recently used data to be written back and issues a cache write back request to Memory Interface 130, which in turn accesses Data Cache Memory units 140, reads the corresponding cache lines and provides the write back data from said cache lines into the External/System Memory (e.g., into peripherals/memory means 302 or 303, L2/L3 cache memories and the like) by means of a Memory Controller 131.
- External/System Memory e.g., into peripherals/memory means 302 or 303, L2/L3 cache memories and the like
- data related to the converted “source 1", “source 2” addresses is outputted from Data Cache Memory 140 over "source 1" and “source 2" read buses 231, 232, respectively, to be processed by means of at least one execution unit 130 (Fig. IB). Further, the processed data is provided from said execution unit 130 into the corresponding cache line, related to the converted "destination" address, of said Data Cache Memory 140 over write back bus 233.
- the cache "miss” is handled in parallel by means of a plurality of Cache Request Handler units 121, 122, etc.
- Memory Controller 131 and Data Cache Memory 140 handle multiple cache "miss” requests in parallel.
- the Hit/Miss Resolve unit 110 can be provided within Cache Controller 120.
- Data Cache Memory database 111 can be provided within Cache Controller 120 or within other unit of Data Cache Memory system 301. Further, according to still another embodiment of the present invention, Data Cache Memory Database 111 also stored CPU mapped addresses related to the cacheable memory.
- SRF Control unit 350 controls the CPU pipeline process by generating required control signals during the CPU pipeline stages.
- Data Cache Memory system 301 is a write-through cache system, in which data written into Data Cache Memory 140, is also written into the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like).
- External/System Memory e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like.
- Fig. 3 is a sample flow chart 200 of Data Cache Memory system 301 (Fig. IB) operation, according to an embodiment of the present invention.
- one or more (e.g., up to three) CPU mapped addresses related to SRF peripheral/memory means are received and converted/decoded by means of Hit/Miss Resolve unit 110 (Fig. 2).
- said Hit/Miss Resolve unit 110 checks whether these decoded addresses are related to the cacheable memory. If not, then it sends a corresponding signal to SRF Control unit 350 (Fig. 2), indicating that these decoded addresses are not related to the cacheable memory, and that the CPU pipeline should be continued accordingly.
- Hit/Miss Resolve unit 110 checks to what cacheable memory unit each of said addresses is related (e.g., to Data Cache Memory unit 141, 142, 143, etc. (Fig 2) or to an External Memory unit, such as peripherals/memory means 302, 303 (Fig. IB)), and further checks at step 210, whether there is a cache hit. For this, Hit/Miss Resolve unit 110 conducts a search within Data Cache Memory Database 111 for said converted/decoded addresses.
- LRU Control unit 127 accesses Data Cache Memory database 111, which also comprises information regarding all available memory cells (cache lines) of Data Cache Memory units 141, 142, etc. (of Data Cache Memory 140 (Fig 2)). If there are no available memory cells (cache lines) in one or more Data Cache Memory units 141, 142, etc. (i.e., there is no required available memory storage space), then LRU Control unit 127 enables performing a cache write back process of the least recently used data by sending a cache write back request to Memory Interface 130 (Fig 2), at step 230.
- Memory Interface 130 accesses Data Cache Memory 140, reads the corresponding cache lines and provides the write back data from said cache lines into the External/System Memory (e.g., into peripherals/memory means 302 or 303, L2/L3 cache memories and the like) by means of a Memory Controller 131 (Fig 2).
- the one or more Cache Request Handler units 121, 122, etc. (Fig. 2) check whether it is a cache miss on reading or writing the data, i.e. whether a cache miss is related to reading the data from the "source 1" or “source 2" address, or is related to writing the data into a "destination" address.
- step 245 said data is fetched by means of Memory Interface 130 into one or more corresponding cache lines of Data Cache Memory 140, which have a required available memory space.
- Cache Controller 120 issues a fetch request to Memory Interface 130, which reads the corresponding "missed" data from the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like) by using Memory Controller 131. After this, the read data is written into the selected cache line(s) of Data Cache Memory (such as Data Cache Memory units 141, 142 and 143).
- Data Cache Memory such as Data Cache Memory units 141, 142 and 143.
- Memory Interface 130 sends a corresponding indication to Cache Controller 130 and to Hit/Miss Resolve unit 140, acknowledging to them that the "missed” data is written into the selected cache line(s), and updating Data Cache Memory Database 111 (Fig 2) with the new cache line(s) allocations.
- Hit/Miss Resolve unit 110 it will be interpreted by means of Hit/Miss Resolve unit 110 as a cache "hit”.
- one or more Cache Request Handler units send corresponding signals to Hit/Miss Resolve unit 110 at step 255, and further to Task Ready Queue unit 150, acknowledging to them that handling the current cache miss (on writing the data) is accomplished, and instructing the CPU pipeline to proceed its execution accordingly by issuing a "task ready" indication to SRF Control unit 350, at step 260.
- the number of instructions and CPU clock cycles required for accessing and manipulating/processing is relatively significantly reduced, compared to the prior art, by providing a substantially direct Data Cache Memory system 301 access for one or more CPU execution units 130 (Fig. IB).
- the number of instructions and corresponding CPU clock cycles for processing the data, provided within/from said Data Cache Memory system 301 can be reduced to a single instruction that takes a single CPU clock cycle, further enabling to generate multiple requests (e.g., up to three requests) for accessing said Data Cache Memory system 301.
- CPU stalls are substantially prevented.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12938508P | 2008-06-23 | 2008-06-23 | |
US61/129,385 | 2008-06-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009156920A1 true WO2009156920A1 (en) | 2009-12-30 |
Family
ID=41444100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2009/052610 WO2009156920A1 (en) | 2008-06-23 | 2009-06-18 | Method, register file system, and processing unit device enabling substantially direct cache memory access |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2009156920A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5890216A (en) * | 1995-04-21 | 1999-03-30 | International Business Machines Corporation | Apparatus and method for decreasing the access time to non-cacheable address space in a computer system |
US6178482B1 (en) * | 1997-11-03 | 2001-01-23 | Brecis Communications | Virtual register sets |
US20020178334A1 (en) * | 2000-06-30 | 2002-11-28 | Salvador Palanca | Optimized configurable scheme for demand based resource sharing of request queues in a cache controller |
US6564301B1 (en) * | 1999-07-06 | 2003-05-13 | Arm Limited | Management of caches in a data processing apparatus |
-
2009
- 2009-06-18 WO PCT/IB2009/052610 patent/WO2009156920A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5890216A (en) * | 1995-04-21 | 1999-03-30 | International Business Machines Corporation | Apparatus and method for decreasing the access time to non-cacheable address space in a computer system |
US6178482B1 (en) * | 1997-11-03 | 2001-01-23 | Brecis Communications | Virtual register sets |
US6564301B1 (en) * | 1999-07-06 | 2003-05-13 | Arm Limited | Management of caches in a data processing apparatus |
US20020178334A1 (en) * | 2000-06-30 | 2002-11-28 | Salvador Palanca | Optimized configurable scheme for demand based resource sharing of request queues in a cache controller |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10860326B2 (en) | Multi-threaded instruction buffer design | |
US5375216A (en) | Apparatus and method for optimizing performance of a cache memory in a data processing system | |
US6151662A (en) | Data transaction typing for improved caching and prefetching characteristics | |
US5249286A (en) | Selectively locking memory locations within a microprocessor's on-chip cache | |
JPH10187533A (en) | Cache system, processor, and method for operating processor | |
WO2006132798A2 (en) | Microprocessor including a configurable translation lookaside buffer | |
US9092346B2 (en) | Speculative cache modification | |
US20180095892A1 (en) | Processors, methods, systems, and instructions to determine page group identifiers, and optionally page group metadata, associated with logical memory addresses | |
US20170286118A1 (en) | Processors, methods, systems, and instructions to fetch data to indicated cache level with guaranteed completion | |
US20140189192A1 (en) | Apparatus and method for a multiple page size translation lookaside buffer (tlb) | |
US20240105260A1 (en) | Extended memory communication | |
US10761979B2 (en) | Bit check processors, methods, systems, and instructions to check a bit with an indicated check bit value | |
CN113900710B (en) | Expansion memory assembly | |
US9405545B2 (en) | Method and apparatus for cutting senior store latency using store prefetching | |
EP0459233A2 (en) | Selectively locking memory locations within a microprocessor's on-chip cache | |
US20030196072A1 (en) | Digital signal processor architecture for high computation speed | |
EP4020229A1 (en) | System, apparatus and method for prefetching physical pages in a processor | |
US10261909B2 (en) | Speculative cache modification | |
CN112559037B (en) | Instruction execution method, unit, device and system | |
JPH02242429A (en) | Pipeline floating point load instruction circuit | |
WO2009156920A1 (en) | Method, register file system, and processing unit device enabling substantially direct cache memory access | |
US11481317B2 (en) | Extended memory architecture | |
WO2009136402A2 (en) | Register file system and method thereof for enabling a substantially direct memory access | |
CN111475010B (en) | Pipeline processor and power saving method | |
US20230153114A1 (en) | Data processing system having distrubuted registers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09769729 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WPC | Withdrawal of priority claims after completion of the technical preparations for international publication |
Ref document number: 61/129,385 Country of ref document: US Date of ref document: 20101220 Free format text: WITHDRAWN AFTER TECHNICAL PREPARATION FINISHED |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.02.12) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09769729 Country of ref document: EP Kind code of ref document: A1 |