WO2009156920A1

WO2009156920A1 - Method, register file system, and processing unit device enabling substantially direct cache memory access

Info

Publication number: WO2009156920A1
Application number: PCT/IB2009/052610
Authority: WO
Inventors: Yoav Peleg
Original assignee: Cosmologic Ltd.
Priority date: 2008-06-23
Filing date: 2009-06-18
Publication date: 2009-12-30

Abstract

The present invention relates to a method, register file system, and processing unit device configured to enable substantially direct access to cache memory means that are, at least partially, provided within said register file system (and within said processing unit device), thereby enabling substantially direct connectivity (communication) between said cache memory means and one or more execution units (such as ALUs).

Description

METHOD, REGISTER FILE SYSTEM, AND PROCESSING UNIT DEVICE ENABLING SUBSTANTIALLY DIRECT CACHE MEMORY ACCESS

Field of the Invention

The present invention relates to processing units. More particularly, the present invention relates to a method, register file system, and processing unit device configured to enable substantially direct access to cache memory means that are, at least partially, provided within said register file system (and within said processing unit device).

Definitions, Acronyms and Abbreviations Throughout this specification, the following definitions are employed:

Cache memory: is a temporary storage memory, where relatively frequent-accessed data can be stored for achieving more rapid data access. Once the data (a copy of the data) is stored in the cache memory, future use can be made by accessing said data in the cache memory rather than refetching or recomputing the original data, so that the average access time to said data is relatively shorter.

Cache Line: a block of memory that is transferred between a memory means (e.g., program/system memory, such as SRAM (Static Random Access Memory)) and a cache memory. The cache line is usually fixed in size, ranging for example from 16 to 256 bytes. The effectiveness of the cache line size depends on the application, and cache circuits may be configurable to a different cache line size. Also, there are several conventional algorithms for dynamically adjusting a cache line size.

Cacheable Memory: memory that can be used as the cache memory, i.e. memory that can be cached.

Fetching: means retrieving an instruction from the program memory, wherein the instruction is represented by a number or by a sequence of numbers. Instruction Register: a register that stores a current instruction to be executed. The instruction register is provided within a processing unit, and is located in physical proximity to processing means, such as ALU (Arithmetic Logic Unit).

Opcode: an opcode (operation code) is the portion of an instruction that specifies an operation to be performed (e.g., addition, subtraction, and the like).

Operand: an instruction operand is data/value or a pointer (address) to the data, on which (or by means of which) an operation/processing (e.g., addition, subtraction, and the like) has to be performed.

Register File: is a storage unit/system located within the processing unit, such as the CPU (Central Processing Unit). Generally, the register file is a combination of registers and combinatorial logic.

Background of the Invention

The past decade is characterized by dramatic developments in the field of computers. For executing and processing most recently developed computer applications, fast and powerful computer processing units are required. In general, according to the prior art, a conventional central processing unit (CPU) operates by four steps: a) fetching; b) decoding (that involves reading data from the CPU register file; c) instruction executing; and d) writing back the result of said executing. The first step, fetching, involves retrieving an instruction from the program memory (e.g., RAM (Random Access Memory)). Instruction location in the program memory is determined by a program counter, which keeps track of the CPU processing in the current program. After the instruction is fetched from the memory, the program counter is incremented by the length of the instruction word in terms of memory units; also, for example, when a conventional JUMP or BRANCH command is received, the program counter value is changed accordingly. Often, the instruction to be fetched must be retrieved from relatively slow memory (e.g., secondary memory) by means of a conventional Input/Output control unit, causing the CPU to stall while waiting for the instruction to be returned back to said CPU. The instruction that the CPU fetches from the memory is used to determine what the CPU has to do, thus the CPU cannot proceed processing until said instruction is fetched. After that, at the decoding step, the instruction is broken up into several portions to be processed by other CPU units (e.g, ALU). The way in which the numerical instruction value is interpreted, is defined by the CPU instruction set architecture (ISA). Often, a group of numbers in the instruction, called an opcode (operation code), indicates which operation has to be performed. The remaining numbers in the instruction usually provide information required for that instruction (e.g., operands for the addition/subtraction operation). Such operands may be given as a constant value (called an immediate value). Alternatively, operands may be provided as addresses of corresponding values stored in a register file (that comprises a plurality of registers, e.g., 32 or 64 registers).

After the fetching and decoding steps, the executing step is performed. During this step, the CPU performs the desired operation. If, for example, an addition operation is requested, the numbers to be added are provided to inputs of the Arithmetic Logic Unit (ALU), and the result (the final sum) will be provided at the ALU outputs. Generally, the ALU comprises a circuitry to perform simple arithmetic and logical operations on the inputs, such as addition/subtraction operations. Finally, at the write back step, the results of the executing step are "written back" to the register file or to CPU registers. After accomplishing the instruction execution and writing back the resulting data, the entire process repeats with the next instruction cycle, normally fetching the next-in- sequence instruction due to the incremented value in the program counter.

For achieving good performance, the above four CPU steps have to be performed relatively fast. However, when working with non-local memory means, such as conventional cache memory, on-board memory (e.g., DRAM (Dynamic Random Access Memory)), secondary memory and the like, the access time is greatly increased leading to significant delays and to a waste of the valuable CPU processing resources. In turn, it greatly decreases CPU performance and consumes most of the CPU processing time.

According to the prior art, a conventional processing unit (e.g., CPU) uses a limited set of registers (e.g., 32 or 64 registers) in its register file. The register file can be implemented in hardware by means of a plurality of electronic elements, such as latches, flip-flops, memory arrays, multi-port SRAM (Static Random Access Memory) and the like. However, in most cases, this register file is a portion of the CPU, and it is located in the physical proximity to the ALU (Arithmetic Logic Unit) of said CPU. Such a register file can be named a "local register file system". One of the reasons for having a limited local register file system is due to the limited size of the CPU program word, which usually contains pointers to 3 registers: one register (accessed via "source 1" input of the local register file system) storing the first value be processed by the ALU, another register (accessed via "source 2" input of the local register file system) storing the second value to be processed by said ALU, and the last register (destination register, accessed via "destination" input of the local register file system) storing the result value of the ALU processing (e.g., the sum of the values stored within said "sources 1 and 2"). Since the CPU program word is limited (in terms of data bits), the number of bits allowed for each of the above registers is low. For example, a CPU that has 64 registers in its register file requires 6 bits address (pointer) per register (2⁶=64), and such a register file is considered to be relatively large, according to the prior art. In addition, even by using a CPU that is capable of receiving instructions as the Very Large Instruction Word (VLIW), the CPU register file relatively rarely reaches a capacity of 256 registers.

Another reason for having a limited number of registers in the CPU register file is due to conventional hardware limitations related to fast memory access and to capability of using a relatively large number of ports. Usually, a conventional ALU requires providing at least two read ports and one write port. A conventional system that implements CPU, also usually contains a memory controller (that comprises a MMU (Memory Management Unit)), various memory means (e.g., cache, SRAM, etc.), and different peripherals, such as cache controllers, interrupt controllers, timers, hardware accelerators, DMA (Direct Memory Access) engine, communication controllers (e.g., a USB controller), and the like. The memory controller controls the CPU access to a wide range of registers/memory means, such as internal CPU memories (program and data), on-chip memories (including, for example, cache memory), on-chip peripheral memories, and off-chip (device) memories.

It should be noted that according to the prior art, the CPU local register file system is significantly limited in its size (e.g., contains only 32 registers), and the CPU memory mapped registers (to be accessed, for example, by CPU internal units, such as the ALU) are physically located outside the CPU local register file system (e.g., cache, secondary memory, etc.). Thus, in order that the CPU will be able to perform data manipulation on any of its memory mapped registers, the CPU needs to generate LOAD commands for loading data by means of the memory controller from each of said memory mapped register (e.g., located off-CPU-chip (outside the CPU chip)) into registers of the CPU local register file system. After the data is loaded, the CPU can manipulate said data (e.g., to perform data addition or data subtraction operations by means of its ALU unit). Then, the result is first stored in another register of the CPU local register file system, and after that said result is conveyed to the corresponding memory mapped register (for example, non-local device/peripheral (located off-CPU-chip)) for updating it with a new data value - the result of ALU processing. For that, the CPU needs to generate at least one STORE command for storing said result within said non-local device/peripheral. In addition, usually a single ALU command (e.g., addition, subtraction, etc.) is related to processing of data located within at least two registers. Therefore, the CPU needs to generate at least two separate LOAD commands (each in a single CPU clock cycle) for loading the data required for processing. In some more complex VLIW CPUs, a multi- LOAD request can be generated in a single CPU clock cycle and then, in an additional clock cycle, one or more (destination) registers within the local register file system can be updated with new data values (with the results of ALU processing). Generally, according to the prior art, for processing (manipulating) data by means of the CPU, at least three commands have to be generated: LOAD (or multi-LOAD) command for loading data from an external register/memory means (device register, cache memory, etc. that is located externally to CPU (off-CPU-chip)) into the local register file system; ALU data processing command for executing various operations (e.g., addition, subtraction); STORE command for writing back the result of ALU processing into the corresponding non-local register/memory means - for this, even if working in a pipeline and avoiding data hazards, such data processing takes at least three CPU clock cycles.

When, for example, the data needs to be moved from one CPU memory mapped register (or from another external memory means, such as cache memory, secondary memory, etc.) to another CPU memory mapped register/memory means without performing ALU data processing, then the DMA (Direct Memory Access) engines, which are CPU peripherals, can be used for conducting such data movements, thereby reading the data from said one CPU memory mapped register/memory means and writing the data to said another off-CPU-chip register/memory means. By using the DMA engines, CPU is not required to generate LOAD and STORE commands. In addition, DMA operations can be conducted in parallel with CPU operations. However, for using the DMA engines, dedicated hardware is required and the DMA engines need to be configured and enabled by the CPU; further, it is applicable only when no data processing (or substantially negligible data processing) is required.

According to the prior art, for loading the data from registers, processing the loaded data and storing the result with the memory (i.e., performing LOAD, "ALU processing" and STORE commands) with minimal CPU stalls (delays), conventional processing devices use cache memories or Tightly Coupled Memories (TCM), such as SRAM, etc. These memories are located in physical proximity to the CPU core, and therefore accessing such memories (e.g., performing LOAD and STORE commands) is done with a relatively low latency compared to accessing other memory means, such as a hard disk, for example. Usually, when the CPU needs to operate on a large chunk of data, which in turn involves performing relatively long loops of ALU commands, the data is first copied by the MMU to the cache memory or by the DMA engine to the tightly coupled memory. Only then, the CPU executes ALU commands within said loops of commands. Thus, this leads to a significant latency between the time when the data is first conveyed to the memory mapped register/memory means and the time when the CPU can process it. Especially, this leads to a significant latency until the processed data is written back into the memory mapped register/memory means.

According to the prior art, the cache memory means are divided into cache lines. Generally, a CPU generates LOAD/STORE command (containing a memory address) for loading data from said memory address or storing data into said address. Then, the cache memory (system) determines whether said memory address corresponds to it, and if so, whether the data related to said memory address is stored within one of its cache lines. When the memory address is related to the cache memory and the corresponding data is stored within said one of the cache lines, then it is called a "hit". On the other hand, when the memory address is not related to the cache memory, or when it is related to the cache memory, but the corresponding data is not stored within said one of the cache lines, then it is called a "miss". According to the prior art, a CPU is stalled in case of a cache miss, waiting for a cache hit. However, when using a multi-threaded CPU, another thread can be executed while waiting for the cache hit. In addition, if there is a cache miss, then the corresponding external memory data (to which said memory address is related) needs to be cached for filling a cache line. Also, when the cache memory is "almost full", then a cache write back is executed, in order to free memory space for the new data. For this, the data stored within a cache line (inside the cache memory) is copied into the external memory (e.g., SDRAM (Synchronous Dynamic Random Access Memory)).

The above problems related to achieving fast data access and performing fast data processing have been recognized in the prior art, and several solutions have been proposed. For example, US 6,178,482 discloses a system embedded with a processor, containing sets of cache lines for accessing cache memories, which are dynamically operated as different register sets for supplying source operands and in turn, accepting destination operands for instruction execution. The different register sets may be of the same or of different virtual register files, and if the different register sets are of different virtual register files, the different virtual register files may be of the same or of different architectures. The cache memories may be directly accessed by using cache addresses.

The present invention has many advantages over the prior art. For example, one advantage of the present invention is that it significantly reduces the number of instructions and CPU clock cycles required for accessing and manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data by providing a substantially direct cache memory access for one or more CPU execution units (for processing the data). Thus, the number of instructions and corresponding CPU clock cycles for processing the data that is stored within the cache memory can be reduced, for example, to a single instruction that takes a single CPU clock cycle, enabling to generate multiple requests (e.g., up to three requests) for accessing said cache memory.

Another advantage of the present invention, is that it provides a method, register file system, and processing unit (device), in which for reducing a number of instructions and CPU clock cycles required for manipulating/processing memory mapped data, there is substantially no need in changing the structure of the conventional CPU program word (conventional CPU instruction).

Still another advantage of the present invention is that it eliminates the need in using conventional DMA engines.

A further advantage of the present invention is that it provides a method and processing unit (device), in which CPU stalls (delays) are substantially prevented.

Other advantages of the present invention will become apparent as the description proceeds. Summary of the Invention

The register file system comprises: a) at least one cache memory system comprising a plurality of cache memory units assigned with memory addresses, said cache memory system configured to: a.l. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory: a.1.1.1. read data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the read data; and a.1.1.2. write data into the one or more cache memory units that corresponds to said at least one memory address; and a.1.2. send a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said at least one cache memory system, if said at least one memory addresses is not related to said cacheable memory; and b) at least one output port for outputting said read data from said cache memory system, giving rise to the outputted cache memory data. According to an embodiment of the present invention, the outputted cache memory data is provided to at least one execution unit that is configured to receive said outputted cache memory data and process it, and further configured to write back data into the one or more cache memory units.

According to another embodiment of the present invention, the cacheable memory comprises one or more external cacheable memory units.

According to still another embodiment of the present invention, the register file system further comprises a memory controller for enabling conveying data from the one or more external cacheable memory units into the one or more cache memory units.

According to still another embodiment of the present invention, the register file system further comprises a memory controller for enabling writing back data from the one or more cache memory units into the one or more external cacheable memory units.

According to a particular embodiment of the present invention, the memory addresses that are received by means of the cache memory system are processing unit (PU) mapped addresses.

According to an embodiment of the present invention, the received PU mapped addresses are decoded or converted by means of one or more of the following: (a) the register file system; and (b) the cache memory system.

According to an embodiment of the present invention, the cache memory system further comprises a hit/miss resolve unit for determining whether the received one or more addresses relate to the cacheable memory and whether there is a cache hit request or cache miss request. According to another embodiment of the present invention, the cache memory system further comprises a cache controller unit for controlling the operation of the cache memory system.

According to still another embodiment of the present invention, the cache memory system further comprises one or more cache request handler units for determining for each cache miss request whether it is a cache miss on reading the data or on writing the data.

According to still another embodiment of the present invention, the cache memory system further comprises a memory interface for enabling said cache memory system to interact with the one or more external cacheable memory units.

According to still another embodiment of the present invention, the cache memory system further comprises a data cache memory device provided with the plurality of data cache memory units for storing data.

According to a further embodiment of the present invention, the cache memory system further comprises a Least Recently Used (LRU) control unit for determining whether there is available storage space within the data cache memory device.

According to still a further embodiment of the present invention, the LRU control unit allocates the required storage space within the data cache memory device.

According to still a further embodiment of the present invention, the LRU control unit enables to perform a write back of data from the data cache memory device into the one or more external cacheable memory units.

According to an embodiment of the present invention, the cache memory system further comprises a task ready queue unit for keeping a status of current cache miss requests being handled by the cache memory system. According to another embodiment of the present invention, the cache memory system further comprises a cache request queue unit configured to maintain a queue of cache miss requests.

The processing unit (PU) device comprises: a) a register file system provided with: a.l. at least one cache memory system comprising a plurality of cache memory units having respective memory addresses, said cache memory system configured to: a.1.1. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory:

- read data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the read data; and - write data into the one or more cache memory units that corresponds to said at least one memory address; and a.1.1.2. send a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said at least one cache memory system, if said at least one memory addresses is not related to said cacheable memory; and a.2. at least one output port for outputting said read data from said cache memory system, giving rise to the outputted cache memory data; and b) at least one execution unit configured to receive said outputted cache memory data and process it, and configured to write back data into said one or more corresponding cache memory units.

The method of operating with a cache memory system, which is at least partially provided within a register file system, and which is configured to substantially directly communicate with an execution unit, said method comprises: a) receiving at least one memory address; b) determining whether said at least one memory address is related to a cacheable memory comprising at least a plurality of cache memory units; and b.l. performing one or more of the following, if said at least one memory address is related to said cacheable memory: b.1.1. reading and outputting data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the outputted cache memory data, and then receiving said outputted cache memory data and processing it; and b.l.2. writing data into the one or more cache memory units that corresponds to said at least one memory address; and b.2. sending a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said cache memory system, if said at least one memory addresses is not related to said cacheable memory.

Brief Description of the Drawings

In order to understand the invention and to see how it may be carried out in practice, preferred embodiments will now be described, by way of non-limiting examples only, with reference to the accompanying drawings, in which: Fig. IA is a schematic illustration of a spread register file system, according to an embodiment of the present invention;

■ Fig. IB is a schematic illustration of a spread register file system, according to another embodiment of the present invention;

Fig. 2 is a schematic illustration of a Data Cache Memory system (e.g., a peripheral/memory means), incorporating converters (decoders) for converting (decoding) one or more inputted CPU mapped addresses, according to an embodiment of the present invention; and

Fig. 3 is a sample flow chart of a Data Cache Memory system operation, according to an embodiment of the present invention.

It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

Detailed Description of the Preferred Embodiments

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, systems, procedures, units, components, circuits and the like have not been described in detail so as not to obscure the present invention.

Hereinafter, wherein the term "spread register file" system or "SRF" system is mentioned, it should be noted that it refers to the expanded (spread) register file according to the present invention, which can be related to the entire CPU memory map, thereby enabling substantially direct memory/peripheral access for one or more CPU execution units (for processing the data). Further, wherein the term "local register file system" or "LRF system" is mentioned, it refers to the conventional CPU local register file system. It should be also noted that according to an embodiment of the present invention, the entire (complete) CPU memory map can comprise local registers (e.g., local CPU register files), cache memories, tightly coupled memories, on-chip/off-chip peripherals/memories (or registers) and any other conventional memory means. Also, wherein the term CPU is mentioned, it refers to any processing unit (PU) device, such as a microprocessor and the like. In addition, wherein the term "processing" (or a similar term) is mentioned, it should be noted that it refers to any data operation, such as data manipulation, data transfer, addition or subtraction of data and the like.

It should be noted that co-pending US provisional patent application no. 61/071,584, titled "Register File System and Method Thereof for Enabling a Substantially Direct Memory Access", discloses a processing unit (e.g., CPU, microprocessor and the like) that implements a SRF system and enables substantially direct memory access for one or more (CPU) execution units (for processing the data). In addition, another co-pending US provisional patent application 61/071,583, titled "Improved Processing Unit Implementing both a Local Register File System and Spread Register File System, and a Method thereof, teaches implementing both a conventional local register file system (LRF system) and spread register file system (SRF system) to be used by one or more processing units. The present invention presents a method, register file system, and processing unit (device) configured to enable substantially direct access to a cache memory (system) that is, at least partially, provided within said register file system (and within said processing unit (device)), thereby enabling substantially direct connectivity (communication) between said cache memory (system) and one or more execution units (such as ALUs).

Fig. IA is a schematic illustration of spread register file system 206, according to an embodiment of the present invention. Spread register file (SRF) system 206 receives at its input ports (not shown) one or more of the following addresses: CPU mapped "source 1" address (MSl address) provided to said SRF system over bus (line) 151, CPU mapped "source 2" address (MS2 address) provided over line 152, and CPU mapped "destination" address (MD address) provided over line 153 (each address is, for example, 32 bits long). According to this embodiment of the present invention, these CPU mapped addresses are converted (decoded) by means of address converters (decoders) 320, 321 and 322, respectively, to addresses of corresponding peripheral/memory means, such as peripherals/memory means 301, 302, 303, ..., 310 (that can be, for example, cache memories, DDR (Double Data Rate) memories, SRAM (Static Random Access Memory) memories, hard disks, and any other memory means or any combination thereof). It should be noted that said address converters can convert the CPU mapped addresses in various ways based on different address converting functions/expressions. In addition, the above address can be converted according to an instruction (program word) opcode inputted from an instruction register (not shown) into a control unit 350 (operatively coupled to the peripherals/memory means), which generates corresponding control signals to address converters 320, 321 and 322 and to executing unit 130 (e.g., ALU): for example, if said instruction opcode relates to moving "source 1" data to the "destination" register/memory cell(s), then only address converters 320 and 322 can be activated.

According to an embodiment of the present invention, the converted "source 1" and "source 2" addresses (CSl and CS2 addresses, respectively) are inputted into corresponding peripheral/memory means (e.g., Data Cache Memory system 301), which in turn outputs corresponding data stored in said addresses over "source 1" read bus 231 and "source 2" read bus 232. Then, said data is processed (executed) by means of one or more execution units 130 (such as ALUs). After that, the processed data (processing result) is provided over write back bus 233 to one or more peripheral/memory means (e.g., Data Cache Memory system 301 or 302) to be stored in corresponding converted destination addresses (CD addresses) within said one or more peripheral/memory means. It should be noted that according to an embodiment of the present invention, the "source 1", "source 2", and "destination" memory cells (related to said "source 1", "source 2" and "destination" addresses, respectively) can be physically located within the same or within different peripheral/memory means.

According to another embodiment of the present invention, address converters 320, 321 and 322 further provide Write Enable (WE)/Chip Select (CS) signals (for example, binary "0" or "1") to each of said peripheral/memory means 301, 302, 303, ..., or 310 (for enabling reading or writing from or to said peripheral/memory means (data units) 301, 302, 303, ..., or 310 (N)). The corresponding WE/CS commands can be provided to each of said peripheral/memory means when accessing each converted address (e.g., "source 1" converted address) within said each peripheral/memory means 301, 302, 303, ..., or 310 (N). For example, for reading data from the converted "source 1" address (e.g., the address of a register within the corresponding peripheral), CS (read command) and/or WE (write command) signals provided to said corresponding peripheral are "1" and "0", respectively; in turn, for writing the data into the converted "destination" address of said corresponding peripheral, the WE signal is "1" and/or CS signal is "0".

It should be noted that according to another embodiment of the present invention, address converters 320, 321 and 322 can be unified in a single address converter for converting CPU mapped "source 1", "source 2" and "destination" addresses into corresponding peripheral/memory means addresses.

Fig. IB is a schematic illustration of spread register file system 206, according to another embodiment of the present invention. According to this embodiment of the present invention, one or more peripherals/memory means, such as Data Cache Memory system 301 can comprise address converters (decoders) 325 and as a result the address conversion (or the address decoding) is performed within said Data Cache Memory system 301. Thus, the need in providing said address converters 320, 321 and 322 can be substantially eliminated. The one or more peripherals/memory means, such as Data Cache Memory system 301 can receive CPU mapped addresses and decode (or convert) them accordingly for determining corresponding addresses, in which the required data is stored (or is to be stored). According to still another embodiment of the present invention, WE/CS Enablers 320', 321' and 322' (which can be further unified in a single WE/CS Enabler) provide WE/CS commands to said Data Cache Memory system 301, and do not perform the address conversion. According to still another embodiment of the present invention, WE/CS commands can be generated by means of control unit 350. Data Cache Memory systems 301 receives a CPU mapped address and determines by means of the integrated address converter(s)/decoder(s) 325 (e.g., according to predefined base-addresses of said Data Cache Memory system 301), whether each received CPU mapped address is related to one or more memory cells provided within said peripheral/memory means or within another peripheral/memory means. It should be noted that the base-addresses of each peripherals/memory means can be further dynamically changed upon the need.

Fig. 2 is a schematic illustration of a Data Cache Memory system (e.g., peripheral/memory means 301), incorporating a converter(s) (decoder(s)) 325 for converting (decoding) one or more inputted CPU mapped addresses, according to an embodiment of the present invention. The Data Cache Memory system comprises: a Hit/Miss Resolve unit 110 configured to determine whether CPU mapped addresses (inputted into said Data Cache Memory system 301 over lines 151, 152 and 153) relate to a cacheable or non-cacheable memory and whether there is a cache hit or cache miss, and further configured to send corresponding signals/commands to SRF Control unit 350 and to internal units of Data Cache Memory system 301; a Cache Controller unit 120 configured to control the operation of Data Cache Memory system 301, comprising: one or more Cache Request Handler units 121, 122, 123, etc. configured to determine for each cache miss whether it is a cache miss on reading the data or on writing the data, and comprising a LRU (Least Recently Used) Control unit 127 configured to determine whether there is an available storage space within said Data Cache Memory system 301, and if there is no (or insufficient) available storage space, then allocating the required storage space within said Data Cache Memory system 301 by enabling to perform a write back process; a Memory Interface 130 configured to enable Data Cache Memory system 301 to interact with an external memory (e.g., peripherals/memory means 302, 303 (Fig. IB), L2/L3 (Level 2/Level 3) cache memories, etc.); Data Cache Memory (device) 140 comprising a plurality of Data Cache Memory units 141, 142, 143, etc. for storing the data; and a Task Ready Queue unit 150 configured to keep a status of current cache misses being handled by Data Cache Memory system 301, and to send an indication to SRF Control unit 350 when said Data Cache Memory system 301 accomplishes handling such cache misses. Further, if Cache Controller 120 can handle each time only a single cache miss request, then a Cache Request Queue unit 125 can be provided within said Cache Controller 120 for maintaining a queue of cache miss requests.

According to an embodiment of the present invention, CPU mapped "source 1" address, "source 2" and "destination" addresses are inputted into Data Hit/Miss Resolve unit 110 over lines 151, 152 and 153, respectively. Then, Hit/Miss Resolve unit 110 checks whether said "source 1", "source 2" and "destination" addresses are related to the memory that is defined as the cacheable memory, which can further comprise external memories, such as peripherals/memory means 302, 303, etc.. If so, then Hit/Miss Resolve unit 110 checks whether the data that corresponds to said addresses is cached in, i.e., within its Data Cache Memory 140 that has a plurality of Data Cache Memory units 141, 142, etc. For that, said CPU mapped "source 1", "source 2" and/or "destination" addresses are converted (decoded) by means of converters (decoders) 325, which can be provided, for example, within Hit/Miss Resolve unit 110, giving rise to "source 1", "source 2" and "destination" converted addresses.

According to an embodiment of the present invention, Hit/Miss Resolve unit 110 conducts a search within a Data Cache Memory database 111, looking for said converted "source 1", "source 2" and "destination" addresses (being related to the cacheable memory). After conducting a search, Hit/Miss Resolve unit 110 has an indication whether each of said converted addresses is related to Data Cache Memory 140 or to external memories, such as peripherals/memory means 302, 303, etc. Further, Hit/Miss Resolve unit 110 receives an indication from said database 111 to exactly what Data Cache Memory unit or external peripherals/memory means said each converted address is related. It should be noted that database 111 is continuously updated by means of Cache Controller 120. In addition, according to an embodiment of the present invention, Hit/Miss Resolve unit 110 conducts multiple searches (for each converted address) within database 111 in parallel. There can be, for example, up to three converted addresses, which are converted from corresponding CPU mapped addresses, depending on a CPU instruction (CPU program word) provided from an instruction register (not shown).

According to another embodiment of the present invention, a cache hit occurs if all converted/decoded addresses (e.g., "source 1", "source 2" and/or "destination" converted addresses) are cacheable and cached in, i.e. all converted/decoded addresses are related to one or more Data Cache Memory units 141, 142, 143, etc. If at least one converted address is not cached in, while such an address is related to a cacheable memory, then it is considered to be a cache miss. It should be noted that a single ALU instruction may result, for example, in up to three cache misses (each cache miss for each converted address).

According to still another embodiment of the present invention, if no converted address is related to a cacheable memory, then the CPU pipeline execution is not stalled, and is continued accordingly. In such a case, Hit/Miss Resolve unit 110 sends a corresponding signal to SRF Control unit 350, acknowledging that no inputted CPU mapped address

(and in turn, no converted address) is related to a cacheable memory. As a result, SRF

Control unit 350 enables to continue the CPU pipeline execution substantially without any CPU stalls. Also, if at least one of the converted (decoded) addresses is related to a cacheable memory, and there is a cache hit for this at least one converted address, then the pipeline execution is also not stalled and is continued accordingly. On the other hand, if there is a cache miss, then according to an embodiment of the present invention, the execution is stalled, and a cache miss indication is forwarded from said Hit/Miss Resolve unit 110 to Cache Controller 120 over line 113. In addition, corresponding cache hit/miss indications are sent to SRF Control unit 350 over line 106. According to an embodiment of the present invention, if Cache Controller 120 handles each time only a single cache miss request, and since a single ALU instruction may result in up to three cache misses, then a Cache Request Queue unit 125 can be provided within said Cache Controller 120 for maintaining a queue of cache miss requests. When a cache miss indication (signal/command) is received by Cache Controller 120, then one or more Cache Request Handler units 121, 122, etc. check whether it is a cache miss on reading or writing the data, i.e. whether a cache miss is related to reading the data from the "source 1" or "source 2" address, or is related to writing the data into a "destination" address. Also, LRU (Least Recently Used) Control unit 127 checks whether there are available memory cells in one or more Data Cache Memory units 141, 142, etc. for storing the data (in case of a cache miss on reading the data or in case of a cache miss on writing the data).

According to an embodiment of the present invention, LRU Control unit 127 controls allocation of the available cache memory. For this, LRU control unit 127 accesses Data Cache Memory Database 111 that contains information regarding the available cache memory, and is continuously updated by means of Cache Controller 120. Also, LRU control unit 127 holds the information regarding cache lines (of the Data Cache Memory units 141, 142, 143, etc.), which have not been used for a relatively long period of time (the "Least Recently Used" cache lines).

If there is a cache miss on reading the data, for example for reading the data stored within "source 1" (or "source 2") address, then said data is fetched by means of Memory Interface 130 into a Data Cache Memory unit, which has available memory cells with a required memory storage space. For this, Cache Controller 120 issues a fetch request to Memory Interface 130, which reads the corresponding "missed" data from the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like) by using Memory Controller 131. Following this, the read data is written into the selected cache line(s) of Data Cache Memory units (such as units 141, 142 and 143). Further, Data Cache Memory Database 111 is updated with the new cache line(s) allocations. Thus, the next time a new request for accessing said selected cache line(s) is issued, it will be interpreted by means of Hit/Miss Resolve unit 110 as a cache "hit".

It should be noted that if more than one Data Cache Memory unit has the required available memory storage space for storing the data to be fetched (the "missed" data that is related to cache missed "source 1" (or "source 2") address, then LRU Control unit 127 selects a Data Cache Memory unit, which is the least recently used, for storing the required data.

According to an embodiment of the present invention, if there are no available memory cells (cache lines) in one or more Data Cache Memory units 141, 142, etc. (i.e., there is no (or insufficient) available memory storage space), then LRU Control unit 127 enables performing a cache write back process of the least recently used data; it should be noted that during the cache write back, the least recently used data is written back from one or more corresponding Data Cache Memory units (from one or more cache lines) into the External Memory by means of Memory Interface 130. It should be noted that the write back process is performed (in case when there is no (or insufficient) available memory storage space) both when there is a cache miss on reading the data (e.g., from the converted "source 1" or "source 2" addresses) and when there is a cache miss on writing the data (e.g., to the converted "destination" address).

For this, LRU Control unit 127 selects said least recently used data to be written back and issues a cache write back request to Memory Interface 130, which in turn accesses Data Cache Memory units 140, reads the corresponding cache lines and provides the write back data from said cache lines into the External/System Memory (e.g., into peripherals/memory means 302 or 303, L2/L3 cache memories and the like) by means of a Memory Controller 131.

It should be noted that data related to the converted "source 1", "source 2" addresses is outputted from Data Cache Memory 140 over "source 1" and "source 2" read buses 231, 232, respectively, to be processed by means of at least one execution unit 130 (Fig. IB). Further, the processed data is provided from said execution unit 130 into the corresponding cache line, related to the converted "destination" address, of said Data Cache Memory 140 over write back bus 233.

According to an embodiment of the present invention, the cache "miss" is handled in parallel by means of a plurality of Cache Request Handler units 121, 122, etc. In addition, Memory Controller 131 and Data Cache Memory 140 handle multiple cache "miss" requests in parallel.

According to another embodiment of the present invention, the Hit/Miss Resolve unit 110 can be provided within Cache Controller 120. In addition, Data Cache Memory database 111 can be provided within Cache Controller 120 or within other unit of Data Cache Memory system 301. Further, according to still another embodiment of the present invention, Data Cache Memory Database 111 also stored CPU mapped addresses related to the cacheable memory.

It should be noted that according to an embodiment of the present invention, SRF Control unit 350 controls the CPU pipeline process by generating required control signals during the CPU pipeline stages.

Further, it should be noted that according to another embodiment of the present invention, Data Cache Memory system 301 is a write-through cache system, in which data written into Data Cache Memory 140, is also written into the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like).

Fig. 3 is a sample flow chart 200 of Data Cache Memory system 301 (Fig. IB) operation, according to an embodiment of the present invention. At step 201, one or more (e.g., up to three) CPU mapped addresses related to SRF peripheral/memory means are received and converted/decoded by means of Hit/Miss Resolve unit 110 (Fig. 2). Then, at step 205, said Hit/Miss Resolve unit 110 checks whether these decoded addresses are related to the cacheable memory. If not, then it sends a corresponding signal to SRF Control unit 350 (Fig. 2), indicating that these decoded addresses are not related to the cacheable memory, and that the CPU pipeline should be continued accordingly. Thus, in such a case, said Data Cache Memory system 301 does not process the data related to said decoded addresses, and the CPU pipeline is proceeded substantially without CPU stalls. If said converted/decoded addresses are related to the cacheable memory, then Hit/Miss Resolve unit 110 checks to what cacheable memory unit each of said addresses is related (e.g., to Data Cache Memory unit 141, 142, 143, etc. (Fig 2) or to an External Memory unit, such as peripherals/memory means 302, 303 (Fig. IB)), and further checks at step 210, whether there is a cache hit. For this, Hit/Miss Resolve unit 110 conducts a search within Data Cache Memory Database 111 for said converted/decoded addresses. If there is a cache hit, then said decoded addresses are related to one or more Data Cache Memory units of Data Cache Memory system 301, and Hit/Miss Resolve unit 110 sends a corresponding signal to SRF Control unit 350 (Fig. 2), indicating that there is a cache hit and that the CPU pipeline should continue its execution accordingly, as instructed in step 206. As a result, the CPU pipeline is proceeded substantially without CPU stalls. On the other hand, if there is a cache miss, then at step 215, said Hit/Miss Resolve unit 110 sends a cache miss indication to Cache Controller 120 (Fig 2) and Control unit 350. In the following step 225, LRU Control unit 127 (Fig. 2) that is provided within Cache Controller 120 (Fig 2) determines whether there is available storage space within said Data Cache Memory 140. For this, LRU Control unit 127 accesses Data Cache Memory database 111, which also comprises information regarding all available memory cells (cache lines) of Data Cache Memory units 141, 142, etc. (of Data Cache Memory 140 (Fig 2)). If there are no available memory cells (cache lines) in one or more Data Cache Memory units 141, 142, etc. (i.e., there is no required available memory storage space), then LRU Control unit 127 enables performing a cache write back process of the least recently used data by sending a cache write back request to Memory Interface 130 (Fig 2), at step 230. In turn Memory Interface 130 accesses Data Cache Memory 140, reads the corresponding cache lines and provides the write back data from said cache lines into the External/System Memory (e.g., into peripherals/memory means 302 or 303, L2/L3 cache memories and the like) by means of a Memory Controller 131 (Fig 2). When said write back is accomplished at step 235, or when no write back is required according to step 225, the one or more Cache Request Handler units 121, 122, etc. (Fig. 2) check whether it is a cache miss on reading or writing the data, i.e. whether a cache miss is related to reading the data from the "source 1" or "source 2" address, or is related to writing the data into a "destination" address.

If there is a cache miss on reading the data, for example when reading the data stored within "source 1" address, then at step 245, said data is fetched by means of Memory Interface 130 into one or more corresponding cache lines of Data Cache Memory 140, which have a required available memory space. For this, Cache Controller 120 issues a fetch request to Memory Interface 130, which reads the corresponding "missed" data from the External/System Memory (e.g., peripherals/memory means 302, L2/L3 cache memories, hard disk and the like) by using Memory Controller 131. After this, the read data is written into the selected cache line(s) of Data Cache Memory (such as Data Cache Memory units 141, 142 and 143). Further, at step 255 (after the "missed" data fetch is accomplished), Memory Interface 130 sends a corresponding indication to Cache Controller 130 and to Hit/Miss Resolve unit 140, acknowledging to them that the "missed" data is written into the selected cache line(s), and updating Data Cache Memory Database 111 (Fig 2) with the new cache line(s) allocations. Thus, when a new request for accessing said selected cache line(s) is issued, it will be interpreted by means of Hit/Miss Resolve unit 110 as a cache "hit". Also, when there is a cache miss on writing the data, one or more Cache Request Handler units send corresponding signals to Hit/Miss Resolve unit 110 at step 255, and further to Task Ready Queue unit 150, acknowledging to them that handling the current cache miss (on writing the data) is accomplished, and instructing the CPU pipeline to proceed its execution accordingly by issuing a "task ready" indication to SRF Control unit 350, at step 260.

It should be noted that according to an embodiment of the present invention, the number of instructions and CPU clock cycles required for accessing and manipulating/processing (e.g., performing addition, subtraction, data moving, data shifting operations and the like) memory mapped data is relatively significantly reduced, compared to the prior art, by providing a substantially direct Data Cache Memory system 301 access for one or more CPU execution units 130 (Fig. IB). Thus, the number of instructions and corresponding CPU clock cycles for processing the data, provided within/from said Data Cache Memory system 301, can be reduced to a single instruction that takes a single CPU clock cycle, further enabling to generate multiple requests (e.g., up to three requests) for accessing said Data Cache Memory system 301.

According to another embodiment of the present invention, for reducing a number of instructions and CPU clock cycles required for accessing and manipulating/processing memory mapped data, there is substantially no need in changing the structure of the conventional CPU program word (conventional CPU instruction).

According to still another embodiment of the present invention, there is no need in using conventional DMA engines.

According to a further embodiment of the present invention, CPU stalls (delays) are substantially prevented.

While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be put into practice with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims

1. A register file system, comprising: a) at least one cache memory system comprising a plurality of cache memory units assigned with memory addresses, said cache memory system configured to: a.l. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory: a.1.1.1. read data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the read data; and a.1.1.2. write data into the one or more cache memory units that corresponds to said at least one memory address; and a.1.2. send a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said at least one cache memory system, if said at least one memory addresses is not related to said cacheable memory; and b) at least one output port for outputting said read data from said cache memory system, giving rise to the outputted cache memory data.

2. The register file system according to claim 1, wherein the outputted cache memory data is provided to at least one execution unit that is configured to receive said outputted cache memory data and process it, and further configured to write back data into the one or more cache memory units.

3. The register file system according to claim 1, wherein the cacheable memory comprises one or more external cacheable memory units.

4. The register file system according to claim 3, further comprising a memory controller for enabling conveying data from the one or more external cacheable memory units into the one or more cache memory units.

5. The register file system according to claim 3, further comprising a memory controller for enabling writing back data from the one or more cache memory units into the one or more external cacheable memory units.

6. The register file system according to claim 1, wherein the memory addresses that are received by means of the cache memory system are processing unit (PU) mapped addresses.

7. The register file system according to claim 6, wherein the received PU mapped addresses are decoded or converted by means of one or more of the following: a) the register file system; and b) the cache memory system.

8. The register file system according to claim 1, wherein the cache memory system further comprises a hit/miss resolve unit for determining whether the received one or more addresses relate to the cacheable memory and whether there is a cache hit request or cache miss request.

9. The register file system according to claim 1, wherein the cache memory system further comprises a cache controller unit for controlling the operation of the cache memory system.

10. The register file system according to claim 8, wherein the cache memory system further comprises one or more cache request handler units for determining for each cache miss request whether it is a cache miss on reading the data or on writing the data.

11. The register file system according to claim 3, wherein the cache memory system further comprises a memory interface for enabling said cache memory system to interact with the one or more external cacheable memory units.

12. The register file system according to claim 1, wherein the cache memory system further comprises a data cache memory device provided with the plurality of data cache memory units for storing data.

13. The register file system according to claim 12, wherein the cache memory system further comprises a Least Recently Used (LRU) control unit for determining whether there is available storage space within the data cache memory device.

14. The register file system according to claim 13, wherein the LRU control unit allocates the required storage space within the data cache memory device.

15. The register file system according to claim 3, wherein the cache memory system further comprises a Least Recently Used (LRU) control unit for enabling to perform a write back of data from the data cache memory device into the one or more external cacheable memory units.

16. The register file system according to claim 8, wherein the cache memory system further comprises a task ready queue unit for keeping a status of current cache miss requests being handled by the cache memory system.

17. The register file system according to claim 8, wherein the cache memory system further comprises a cache request queue unit configured to maintain a queue of cache miss requests.

18. A processing unit (PU) device, comprising a) a register file system provided with: a.l. at least one cache memory system comprising a plurality of cache memory units having respective memory addresses, said cache memory system configured to: a.1.1. receive at least one memory address and determine whether said at least one memory address is related to a cacheable memory comprising at least said plurality of cache memory units; and a.1.1.1. perform one or more of the following, if said at least one memory address is related to said cacheable memory:

- read data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the read data; and

- write data into the one or more cache memory units that corresponds to said at least one memory address; and a.1.1.2. send a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said at least one cache memory system, if said at least one memory addresses is not related to said cacheable memory; and a.2. at least one output port for outputting said read data from said cache memory system, giving rise to the outputted cache memory data; and b) at least one execution unit configured to receive said outputted cache memory data and process it, and configured to write back data into said one or more corresponding cache memory units.

19. The processing unit device according to claim 18, wherein the cacheable memory further comprises one or more external cacheable memory units.

20. The processing unit device according to claim 18, wherein the cacheable memory further comprises a memory controller for enabling conveying data from the one or more external cacheable memory units into the one or more cache memory units.

21. The processing unit device according to claim 18, wherein the cacheable memory further comprises a memory controller for enabling writing back data from the one or more cache memory units into the one or more external cacheable memory units.

22. The processing unit device according to claim 18, wherein the cache memory system further comprises a hit/miss resolve unit for determining whether the received one or more addresses relate to the cacheable memory and whether there is a cache hit request or cache miss request.

23. The processing unit device according to claim 18, wherein the cache memory system further comprises a cache controller unit for controlling the operation of the cache memory system.

24. The processing unit device according to claim 22, wherein the cache memory system further comprises one or more cache request handler units for determining for each cache miss request whether it is a cache miss on reading the data or on writing the data.

25. The processing unit device according to claim 19, wherein the cache memory system further comprises a memory interface for enabling said cache memory system to interact with the one or more external cacheable memory units.

26. The processing unit device according to claim 18, wherein the cache memory system further comprises a data cache memory device provided with the plurality of data cache memory units for storing data.

27. The processing unit device according to claim 26, wherein the cache memory system further comprises a Least Recently Used (LRU) control unit for determining whether there is available storage space within the data cache memory device.

28. The processing unit device according to claim 27, wherein the LRU control unit allocates the required storage space within the data cache memory device.

29. The processing unit device according to claim 19, wherein the cache memory system further comprises a Least Recently Used (LRU) control unit for enabling to perform a write back of data from the data cache memory device into the one or more external cacheable memory units.

30. A method of operating with a cache memory system, which is at least partially provided within a register file system, and which is configured to substantially directly communicate with an execution unit, said method comprising: a) receiving at least one memory address; b) determining whether said at least one memory address is related to a cacheable memory comprising at least a plurality of cache memory units; and b.l. performing one or more of the following, if said at least one memory address is related to said cacheable memory: b.1.1. reading and outputting data from one or more cache memory unit that corresponds to said at least one memory address, giving rise to the outputted cache memory data, and then receiving said outputted cache memory data and processing it; and b.1.2. writing data into the one or more cache memory units that corresponds to said at least one memory address; and b.2. sending a corresponding signal from said cache memory system toward a control unit, which is operatively coupled to said cache memory system, if said at least one memory addresses is not related to said cacheable memory.

31. The method according to claim 30, further comprising providing the outputted cache memory data to at least one execution unit that is configured to receive said outputted cache memory data and process it, and further configured to write back data into the one or more cache memory units.

32. The method according to claim 30, further comprising providing one or more external cacheable memory units within the cacheable memory.

33. The method according to claim 32, further comprising enabling conveying data from the one or more external cacheable memory units into the one or more cache memory units.

34. The method according to claim 32, further comprising enabling writing back data from the one or more cache memory units into the one or more external cacheable memory units.

35. The method according to claim 30, further comprising determining whether there is a cache hit request or cache miss request.

36. The method according to claim 35, further comprising determining for each cache miss request whether it is a cache miss on reading the data or on writing the data.

37. The method according to claim 32, further comprising enabling the cache memory system to interact with the one or more external cacheable memory units.

38. The method according to claim 30, further comprising determining whether there is available storage space within the cache memory system.

39. The method according to claim 30, further comprising allocating the required storage space within the cache memory system.

40. The method according to claim 30, further comprising performing a write back of data from the cache memory system into the one or more external cacheable memory units.

41. The method according to claim 30, further comprising keeping a status of current cache miss requests being handled by the cache memory system.

42. The method according to claim 30, further comprising maintaining a queue of cache miss requests.