WO1998003919A1 - A cache system and method using tagged cache lines for matching cache strategy to i/o appication - Google Patents

A cache system and method using tagged cache lines for matching cache strategy to i/o appication Download PDF

Info

Publication number
WO1998003919A1
WO1998003919A1 PCT/US1997/008921 US9708921W WO9803919A1 WO 1998003919 A1 WO1998003919 A1 WO 1998003919A1 US 9708921 W US9708921 W US 9708921W WO 9803919 A1 WO9803919 A1 WO 9803919A1
Authority
WO
WIPO (PCT)
Prior art keywords
cache
memory
memory region
data
tag
Prior art date
Application number
PCT/US1997/008921
Other languages
French (fr)
Inventor
K. Vincent Bridgers
Michael Drake
Richard W. Schuckle
Original Assignee
Advanced Micro Devices, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices, Inc. filed Critical Advanced Micro Devices, Inc.
Publication of WO1998003919A1 publication Critical patent/WO1998003919A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0888Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using selective caching, e.g. bypass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/20Employing a main memory using a specific memory technology
    • G06F2212/206Memory mapped I/O

Definitions

  • the present invention relates to computer systems. More specifically, the invention relates to a cache memory and method of operating a cache memory in a computer system.
  • a cache memory is a high-speed memory array which acts as a buffer between a CPU and a main memory of a computer system.
  • a cache memory is effective because most software repeatedly executes the same instructions or manipulates data in the same spatial location in the memory.
  • a cache memory increases effective memory speed by accessing slower main memory only once to fetch instructions or data with subsequent executions operating from the faster cache memory.
  • a cache memory includes a data memory, a controller, and a directory.
  • the cache memory controls the CPU. If the cache memory includes a valid copy of data requested by the CPU during a CPU read cycle, then the cache allows the CPU to operate at the speed of the cache. If the cache does not contain a copy of the data, then the cache initiates a main memory read cycle, copies the data supplied by the main memory which the main memory indicates is valid using a ready signal, and allows the CPU to continue operating. Data applied to the CPU is routed by the cache controller either from the cache for the case of a cache hit or from buffers that isolate the main memory from the CPU for a cache miss. The cache hit results in no wait states. For the cache miss, the cache controller disables a memory ready signal so that the CPU inserts wait states. The cache memory performs this control function by intercepting all signals from the CPU, including input signals and output signals, thereby isolating the CPU from other devices.
  • a cache line typically includes 16 or 32 bytes.
  • the cache line fill replaces a line in the cache.
  • Data that is changed while in the cache is written to memory prior to replacement.
  • One technique for writing changed cache data to memory is a write-through technique in which a write operation of the CPU causes a data transfer to main memory both for cache hits and cache misses.
  • the write-through technique advantageously ensures consistency of the main memory, if only for single-processor systems.
  • the write-through technique disadvantageously causes wait states.
  • An alternative technique for writing changed cache data to memory is a write-back cache technique in which all write operations are collected and the main memory is only updated upon receipt of an explicit update command or inquiry.
  • a command is a software "write-back and invalidate data cache" (BINVD) instruction.
  • BINVD software "write-back and invalidate data cache”
  • Another example of such a command is a hardware FLUSH signal which occurs implicitly as the result of a cache miss.
  • the disadvantage of the write-back cache is that the exchange of cache lines has a longer duration because data must be written into memory before new data is read into the CPU.
  • a further alternative technique is a write-allocate technique in which a the cache controller implements a write-allocate strategy by filling the cache space for a cache line with the data for the address to be written.
  • the data is first written through to the memory.
  • the cache controller then reads the applicable cache line with the entry to be updated into the cache.
  • the data is written through so that the CPU can restart program execution.
  • the cache controller independently performs the write-allocate in parallel with the CPU operation.
  • Multimedia computers access data in the forms of video data flows, audio data flows, video frames and the like, in addition to conventional handling of data and instructions. Caching characteristics of the different data flows are substantially different.
  • video data may flow through a cache memory while undergoing recursive processing in which the same data is processed periodically. The video data quickly is replaced due to the large volumes of data that is processed. The same data then must be brought back into the cache for recursive processing. This processes constantly repeats causing considerable thrashing of the cache memory.
  • audio data typically involves smaller data volumes but is most efficiently handled using cache strategies that are different from strategies used for handling video data, conventional data or operating instructions.
  • a computer system including a processor, a main memory and a cache memory uses tagging of various regions of memory to define and select caching properties of transfers between the processor and memory via the cache.
  • the main memory contains not only standard random access memory (RAM) and read-only memory (ROM) but also memory-mapped input/output (I/O) sources. Tagging of the memory regions configures the regions for association with a particular set of caching properties.
  • the caching properties are tagged according to the character of the data, in terms of data locality, the volume and timing of data transmission, the type of processing performed on the data, and the like. For example, a memory-mapped video I/O buffer may be tagged with a
  • MM_IO_VBUF tag designating the caching properties of write-back cacheability with weak read/write ordering.
  • Low-level operating system software such as the Hardware Abstraction Language (HAL) interface of the Windows NTTM operating system or device driver software, initialize the memory regions, the cache and make symbolic associations between the memory regions and the cache.
  • the cache operating as directed by the memory tags, allows read and write operations that are used for performing various types of multimedia or signal processing operations including decompression, drawing operations, compression, mixing, and the like, which are performed on a virtually-cached multimedia drawing surface.
  • the data for performing the multimedia or signal processing operations are either already located on the I/O surface or read from another storage location local to the processor or from an external processor, which is also cached and tagged as a special cached region.
  • the processor executes operations acting on the cached data. When all operations are completely executed by the processor, only the cached memory regions are flushed using a flush instruction such as CFLSH[MM_IO_VBUF].
  • the flush instruction directs the cache to write back and invalidate the regions having the designated tag, here MM IO VBUF.
  • One advantage is that various types of calculations are performed in place on memory-mapped I/O buffers very quickly, reducing the time of total operations and increasing computer system performance.
  • FIGURE 1 is a schematic block diagram illustrating a suitable computer system for incorporating a cache memory and a plurality of memory-mapped I/O surfaces in accordance with an embodiment of the present invention.
  • FIGURE 2 is a schematic block diagram showing a suitable virtual cache structure for implementing a cache system and method using tagged cache lines for matching cache policy to I/O application.
  • FIGURE 3 is a memory map which illustrates an embodiment of a main memory made up of a plurality of memory regions.
  • FIGURE 4 shows a storage for each memory region including memory region tag information.
  • FIGURE 5 is a schematic block diagram shows an organization of a low-level operating system for controlling the operations of the cache.
  • FIGURE 6 is a block diagram of a microprocessor connected to a secondary cache in accordance the present invention.
  • FIGURE 7 is a block diagram which illustrates the functional and structural blocks of a secondary cache in accordance with the present invention.
  • FIGURE 1 a schematic block diagram illustrates a suitable audio and video performance computer system 100 for incorporating a cache storage 102 and a plurality of memory-mapped I/O devices including an audio peripheral 104, a motion video peripheral 106, a small systems computer interface (SCSI) host bus adapter 108, a local area network (LAN) adapter 110, a serial input/output (SIO) PCI/ISA bridge 112, and a graphics adapter 114.
  • the various memory-mapped I/O devices include regions of memory that are tagged to define region-specific cache operating characteristics, allowing for the cache strategy to be matched to the operations and functions performed by devices within the selected memory regions.
  • the computer system 100 employs an architecture based on a bus, such as a PCI bus interface 120, and includes a central processing unit (CPU) 116 connected to the PCI bus interface 120 through a Host/PCI/Cache interface 118.
  • the Host/PCI/Cache interface 118 includes the cache storage 102.
  • the CPU 116 is connected to a main system storage 122 through the Host/PCI/Cache interface 118.
  • FIGURE 2 a schematic block diagram shows a suitable virtual cache structure for implementing a cache system and method using tagged cache lines for matching cache policy to I/O application.
  • the illustrative system includes a processor 200, a memory management unit (MMU) 202, address buffers 204, data buffers 206, a system bus 208 and a virtual cache 210.
  • the processor 200 accesses information in a main storage (not shown) connected to the system bus 208 via the virtual cache 210 and uses tagging of various regions of the main storage to define and select caching properties of transfers between the processor 200 and storage via the virtual cache 210.
  • MMU memory management unit
  • the virtual cache 210 includes a cache memory 212 including a plurality of cache lines and a tag memory 214 having a plurality of tags. Each tag of the tag memory 214 corresponds to a cache line of the cache memory 212.
  • the tags include a plurality of information fields defining a state of the corresponding cache line.
  • a cache line is the smallest division of a cache memory for which there is a distinct tag.
  • a cache line typically may include one or more words of cache memory.
  • One of the information fields in the tag memory 214 is an address field 216 specifying a memory address in the memory space that is associated to the cache line at a particular time. Other information fields contain state information that is permanently stored in association with a memory region containing the memory address held in the address field 216.
  • a memory map 300 is shown which illustrates an embodiment of a main memory 301 made up of a plurality of memory regions.
  • the main memory 301 contains not only standard random access memory (RAM) and read-only memory (ROM) but also memory-mapped input/output (I/O) sources.
  • the main memory 301 includes an operating system region 302, a program memory region 304, an I/O buffer region 310, a video frame buffer region 312, an audio buffer region 314, a high-speed audio/video device buffer 316, and the like.
  • the various memory regions include a storage holding tag information that is transferred to the tag memory 214 shown in FIGURE 2 when a cache line is associated to a memory region.
  • the tag information is advantageously selected to improve caching and memory access performance for the particular type of information that is transferred from the particular memory region.
  • the processor 200 performs various operations that are specific to a particular memory region. Often the algorithms, types of data, volume of data, and data handling conditions and characteristics are specific to a particular memory region. For example, the processor 200 generally performs algorithmic and computation-intensive operations such as mixing, down sampling, up sampling, cancellation, and the like on data such as audio and video data. These operations are rarely performed on information in a region such as a program memory region but are often performed on data in memory-mapped regions connected to signal sources. In addition, audio and video data is generally accessed in large buffers of data, sometimes in buffers of one megabyte or larger. When data of this volume is accessed using a conventional cache structure, locality, the principal of cache operation, is quickly destroyed.
  • the specific memory regions include tag information that is written to the tag memory 214 when a cache line is associated to the memory region, thereby directing the specified caching policies so that the cache line handles the data in a manner which is favorable to the data characteristics of the memory region.
  • Examples of caching properties, attributes and strategies that are defined and selected by the tags in the tag memory 214 include read-cacheability, writeback cacheability with strong read/write ordering, writeback cacheability with weak read/write ordering.
  • an illustrative storage 400 for each memory region includes memory region tag information in a cache technique field 410, a cache line allocate field 412, a replacement algorithm control field 414, and a special codes field 416.
  • the storage 400 may be located at a memory-mapped address within the memory region or may be held elsewhere, for example, in an operating system region of memory.
  • the cache technique field 410 is used to specify the caching technique employed for a particular memory region. For example, write-back, write-through and read-through caching techniques may be specified.
  • a write-through cache always updates main memory during all write cycles.
  • Three alternative write-through caching techniques include: (1) Updating the cache if a hit occurs and the cache ignores the write cycle if a miss occurs. (2) If a miss occurs, the line is automatically invalidated. (3) A line in the cache is written whether the cycle is a hit or miss.
  • the cache does not always update the main memory but reduces main memory accesses by writing data to the cache alone at a much higher speed that data is written to main memory. Data written into a cache by the CPU is not written into main memory until the data line in cache is to be replaced. A Dirty bit is set in the cache when a location has been written.
  • Updating of the main memory is typically controlled using a "Dirty" bit for each cache line which designates whether a line in the cache is more current than its corresponding main memory location.
  • the Dirty bit allows only those lines in the cache that have been modified to be written to main memory, thereby conserving a substantial amount of bus bandwidth with unnecessary write operations.
  • Data which are written to the cache but not updated in the main memory are evicted to the main memory during replacement when the Dirty bit is set.
  • the cache line allocate field 412 associates a particular region of memory to a particular cache line in the cache memory 212. In various embodiments, the cache line allocate field 412 specifies operational characteristics such as characteristics controlling how the cache line is retained in the cache memory 212, controlling how the cache line is written back to memory, and preventing automatic allocation of cache lines to the memory region.
  • the replacement algorithm control field 414 is used to select a particular cache replacement algorithm from a plurality of implemented replacement algorithms.
  • Implemented replacement algorithms may includes any suitable known replacement algorithms including, but not limited to, a least recently used (LRU) algorithm, a pseudo-least recently used algorithm, and a round robin replacement algorithm.
  • the special codes field 416 includes codes for selecting special cache operations.
  • a special code may be included for activating a filter storage operation in which a plurality of locations in the cache line are reserved for storing intermediate values determined in a filtering operation or to cache selected variables that are used repeatedly in a length computation. These storage locations are allocated and fixed in the cache memory so that, as the computation executes through a data stream, cache misses are avoided since frequently- used intermediate calculation variables are fixed in the cache memory. By avoiding cache misses and reads from main memory, performance is enhanced over the course of a lengthy computation across a data stream.
  • the cache operating as directed by the special codes field 416, stores data for read and write operations that are used for performing various types of multimedia or signal processing operations including decompression, drawing operations, compression, mixing, and the like, which are performed on a virtually-cached multimedia drawing surface.
  • Data for performing the multimedia or signal processing operations are either located on the I/O surface or read from another storage location local to the processor or from an external processor, which is also cached and tagged as a special cached region.
  • the processor executes operations acting on the cached data. When all operations are completely executed by the processor, only the cached memory regions are flushed using a flush instruction such as CFLSH[MM_IO_VBUF].
  • the flush instruction directs the cache to write back and invalidate the regions having the designated tag, here MM IO VBUF.
  • a conventional cache has reduced performance during handling of large streams of data for signal processing applications such as multimedia applications. Reduced performance results because the large streams of data disrupt the locality principle upon which caches operate, causing thrashing as the same data is repeatedly inserted and removed from the cache.
  • the illustrative caching system adapts the cache to implement caching policies that are advantageous to the data handling of a particular memory region so that locality is maintained even during handling of large amounts of data, large data streams, and lengthy computations.
  • a schematic block diagram shows an organization of a low- level operating system for controlling the operations of the cache.
  • Low-level operating system software such as the Hardware Abstraction Language (HAL) interface of the Windows NTTM operating system or device driver software such as a video or multimedia device driver, initializes the memory regions, the cache and make symbolic associations between the memory regions and the cache.
  • the low-level operating system 500 includes a hardware abstraction level (HAL) 510, a processor/machine 512, and, above the HAL 510, various drivers.
  • the drivers include an audio driver 514, a video driver 516, and a signal processing filter 518.
  • Each of these operating system structures operates using control information accessed from tags in the various memory regions, achieving a large degree of flexibility in the operation of the operating system. Regions of system memory are dedicated to particular applications and the cache configured to most efficiently handle data for the particular applications.
  • Drivers such as the audio driver 514 and the video driver 516, initialize the cache as data is accessed from a specific memory region.
  • the driver accesses the tagged information from the memory region, uses the information to determine system capabilities, and adapts the structure and operations of a cache line according to the tagged information.
  • the information may include, for example, a base memory address, a list of attributes, user control information.
  • the drivers then query the hardware abstraction level 510, using the tagged information to specify resource requirements.
  • Tagging of the memory regions configures the regions for association with a particular set of caching properties.
  • a memory-mapped video I/O buffer may be tagged with a MM IO VBUF tag designating the caching properties of write-back cacheability with weak read/write ordering.
  • the hardware abstraction level 510 allocates resources, for example, designating a cache line for operation as an audio buffer and another cache line for operation as a frame buffer, and designating the length of the buffers.
  • the tagged memory information specifies latency information for usage by the drivers in determining appropriate buffer sizes. Referring to FIGURE 6, a superscalar X86 microprocessor 600 executes the X86 instruction set.
  • Microprocessor 600 is connected to physically addressed external memory 601 via a 486 XL bus or other conventional microprocessor bus.
  • Microprocessor 600 includes a primary (LI) cache 605 incorporating an instruction cache 604 and a data cache 650.
  • the instruction cache 604 is connected to byte queue 606 which is connected to instruction decoder 608.
  • Instruction decoder 608 is connected to RISC core of the microprocessor 600.
  • RISC core includes register file 612 and reorder buffer 614 as well as a variety of functional units such as shift unit 630 (SHF), arithmetic logic units 631, 632 (ALU0 and ALU1), special register block 633 (SRB), load/store unit 634 (LSSEC), branch section 635 (BRNSEC), and floating point unit 636 (FPU).
  • the microprocessor 600 is a single VLSI chip circuit. In a circuit that is connected to the microprocessor circuit VLSI chip, a secondary (L2) cache 690 is supplied which is connected to the primary cache 605 of the microprocessor chip.
  • the LI cache 605 is organized as a split data and instruction cache including an 8K byte 4-way set-associative data cache 650 and a 16K byte 4-way set-associative instruction cache 604. Both the data cache 650 and the instruction cache 604 are virtually addressed and virtually tagged to avoid translation of addresses before a cache access. A single set of physical tags is shared by both the instruction cache 604 and the data cache 650. When a change is made to the virtual-to-physical mapping of the LI cache 605, the virtual cache tags are invalidated.
  • the instruction cache is organized as 1024 blocks of 16 predecoded X86 instruction bytes.
  • the LI cache 605 has an address tag array that is dual-ported and contains 1024 entries, each including a 20-bit linear address tag, a single valid bit designating valid status for the entire block and 16 individual byte-valid bits, one for each of the sixteen corresponding instruction bytes within the instruction store array.
  • the data cache 650 is dual-ported and connected to dual load/store units 634 to allow two accesses to the 8K byte data cache 650 to be performed in a single clock cycle so long as no two accesses are to the same bank. Dual load/store units 634 are implemented to address a high incidence of load and store operations that typically result due to the small number of registers provided in an x86 architecture. Ports (not shown) of the data cache 650 support one or two load operations or, alternatively, one or two store operations.
  • the data cache 650 includes four banks and has two access ports, one for each load/store unit 634. Data accesses to different banks proceed in parallel although two accesses to the same cache bank occur in the same cycle if both accesses are applied to the same cache line.
  • Data cache 650 is connected to load/store functional unit 634 of the RISC core and with intraprocessor address and data (I.AD) bus 602.
  • Instruction cache 604 is also connected with IAD bus 602.
  • the instruction cache 604 and data cache 650 in combination, form the primary (LI) cache 605 on the microprocessor chip.
  • the primary (LI) cache 605 is connected to the secondary (L2) cache 690 which is not fabricated on the VLSI microprocessor chip.
  • Instruction cache 604 and data cache 650 are both linearly addressable caches. Instruction cache 604 and data cache 650 are physically separate, however, both caches are organized using the same architecture.
  • Microprocessor 600 also includes a physical tag circuit 662. Physical tag circuit 662 interacts with both instruction cache 604 and data cache 650 via the IAD bus.
  • RISC core 610 includes A and B operand buses 616, type and dispatch (TAD) bus 618 and result bus 640 which are connected to the functional units as well as displacement and instruction, load store (INLS) bus 619 which is connected between instruction decoder 608 and load/store unit 634.
  • a and B operand buses 616 are also connected to register file 612 and reorder buffer 614.
  • TAD bus 618 is also connected to instruction decoder 608.
  • Result bus 640 is also connected to reorder buffer 614.
  • branch section 635 is connected to reorder buffer 614, instruction decoder 608 and instruction cache 604 via Xtarget bus 603.
  • a and B operand buses 616 includes four parallel 41 -bit wide A operand buses and four parallel 41 -bit wide B operand buses as well as four parallel 12-bit wide A tag buses, four parallel 12- bit wide B tag buses, a 12-bit wide A tag valid bus a 12-bit wide B tag valid bus, four 4-bit wide destination tag buses and four 8-bit wide opcode buses.
  • Type and dispatch bus 618 includes four 3-bit wide type code buses and one 4-bit wide dispatch buses.
  • Displacement and INLS bus 619 includes two 32-bit wide displacement buses and two 8-bit wide INLS buses.
  • Microprocessor 600 also includes memory management unit (MMU) 664 and bus interface unit 660 (BIU). TLB 664 is connected with the IAD bus and physical translation circuit 662. Bus interface unit 660 is connected to physical translation circuit 662, data cache 650 and IAD bus 602 as well as an external microprocessor bus such as the 486 XL bus. Microprocessor 600 executes computer programs which include sequences of instructions. Computer programs are typically stored on a hard disk, floppy disk or other nonvolatile storage media, such as a BIOS ROM, which are located in the computer system. When the program is run, the program is loaded from the storage media into main memory 601. Once the instructions of the program and associated data are in main memory 601, individual instructions are prepared for execution and ultimately executed by microprocessor 600.
  • MMU memory management unit
  • BIU bus interface unit 660
  • the instructions After being stored in main memory 601, the instructions are passed via bus interface unit 660 to instruction cache 604, where the instructions are temporarily held.
  • the x86 instructions are provided by instruction cache 604 to instruction decoder 608 using a fetch program counter value which is generated by instruction cache 604.
  • the secondary (L2) cache 690 includes a static random access memory (SRAM) data and code memory 710, a cache SRAM control circuit 712, a tag control circuit 714, a cache decode circuit 716, a cache tag memory 718, a back invalidation buffer and control circuit 720 and a replacement algorithm select circuit 760.
  • the L2 cache 690 is organized as a split data and instruction cache including an 8K byte 4-way set- associative data cache 650 and a 16K byte 4-way set-associative instruction cache 604.
  • the line size of the L2 cache 690 is 16 bytes, matching the line size of the LI cache 605.
  • Both the LI cache 605 and L2 cache 690 maintain coherency line by line through snooping, inclusion of contents of the LI cache 605 into L2 cache 690 storage.
  • LI cache 605 invalidates are limited through usage of inclusion.
  • the L2 cache 690 directory is used to prescreen invalidation cycles before the cycles are passed to the smaller LI cache. Accordingly, the contents of the LI cache are always a subset of the contents of the L2 cache.
  • Characteristics of the LI cache 605 greatly influence the operations of the L2 cache 690.
  • the 8K byte 4-way set-associative data cache 650 and a 16K byte 4-way set- associative instruction cache 604 greatly reduce the number of external read cycles seen by the L2 cache 690 and the main memory 601. Accordingly, the processor external bus cycles are more likely to be write cycles than read cycles.
  • the L2 cache 690 is a split, 4-way set associative look-aside cache.
  • the L2 cache 690 operates under the control of the processor.
  • the L2 cache 690 sets a read condition as a default condition for the next processor cycle. The cycle therefore begins by enabling data RAM output signals onto the data bus. These RAM output signals are disabled under two conditions. First, when the processor W/R# output signal is high, indicating a requested write cycle. Second, during a read cycle after the cache-tag RAM detects a miss or a cycle to a noncacheable address. In the case of a cache hit, the cache-tag RAM and the tag control circuit 714 signify that the cache data is ready using a burst ready input signal so that the processor can continue the write operation.
  • the processor In the case of a cache miss, the processor is stalled and the cache data RAM is disabled from placing data onto the data bus. If the L2 cache 690 contains a valid copy of data or code requested by the processor during a processor read cycle, then the L2 cache 690 allows the processor to perform at the speed of the L2 cache 690. If the L2 cache 690 does not contain a copy of the data or code, then the L2 cache 690 initiates a read cycle of the main memory. The L2 cache 690 then copies the data or code supplied by the main memory and allows the processor to continue execution. Data or code accessed by the processor is routed by the cache SRAM control circuit 712 from the SRAM data and code memory 710 for a cache hit and otherwise from the main memory in the case of a cache miss.
  • the L2 cache 690 intercepts input and output signals of the processor and determine whether these signals are to be applied to the main memory or retained local to the L2 cache 690.
  • the address designated by the processor propagates through the address buffers to the bus at the beginning of all cycles, whether the cycle is a read hit or a read miss cycle.
  • a look-aside cache strategy allows a cache to be removed or inserted into a computer system with no effects other than performance. Main memory transactions are initiated before a cache hit or miss is detected and are aborted if a cache hit occurs.
  • the lookaside L2 cache 690 initiates a main memory access synchronously with the SRAM data and code memory 710 access, rather than waiting for a cache miss before requesting a main memory access. In this manner, the main memory access time is reduced in the event of a cache miss.
  • a read miss cycle occurs when the contents of the cache tag memory 718 do not match the address directed by the processor.
  • L2 cache 690 output signals to the data bus DATA(63:0) are terminated and data from the main memory is allowed to be transferred to the processor.
  • a write operation information is written to the L2 cache 690 for immediate or subsequent modification of the main memory.
  • the L2 cache 690 implements a write-once protocol in which the processor must write to a main memory location once before the processor can maintain exclusive use of that location. During this write cycle, the address of the location is invalidated in all other caches. This exclusive use is terminated when another processor attempts to read that location from main memory. Whether the write to main memory is immediate or subsequent is determined by an implemented write policy.
  • a write- through write policy involves writing data from the processor to both the cache and main memory simultaneously to assure coherency.
  • data written into cache by the processor is not written into main memory until a data line in cache is to be replaced.
  • a line is the smallest division of a cache memory that is labeled using a distinct tag.
  • a data transfer cycle is terminated by an external assertion of either the ready RD Y# or burst ready BRDY# input signals to the processor. Either or both of these signals are used to signify the presence of valid data and the end of the transfer cycle.
  • the absence of either the ready RDY# or burst ready BRDY# input signal causes the processor to enter a WAIT state until the rising edge where one of these signals is sampled active.
  • the SRAM data and code memory 710 is a small and fast memory for storing replicas of instructions and data that, if accessed directly from a main memory, would generally be accessed more slowly.
  • the SRAM data and code memory 710 is connected to receive control signals from the cache SRAM control circuit 712 and is connected to the data bus DATA(63:0) to form a data path for transferring data and code among the L2 cache 690, the LI cache 605 and the main memory 601.
  • the SRAM data and code memory 710 is readable in the manner of a block of main memory at a predetermined main memory address.
  • the readable nature of memory in the L2 cache 690 allows test software to be written which compares information in the SRAM data and instruction memory 710, including states of a line of cache memory, to information held in the LI cache 605.
  • the LI cache 605 similarly allows data and instruction memory to be accessed through software calls to special registers and a data port to read out data in the tag and data arrays of the LI cache 605 to accommodate this comparison.
  • SRAM data and code memory 710 has 64K sets with four 16-byte ways (also called banks) per set. Each of the four ways are substantially identical.
  • the L2 cache 690 is a 4-way set-associative cache that uses four cache data RAM ways 730, 732, 734 and 736 in the SRAM data and code memory 710.
  • the cache SRAM control circuit 712 implements a control algorithm which controls transfer of data and code among the SRAM data and code memory 710, the LI cache 605 and the main memory 601 and further controls transfer of addresses among the cache tag memory 718 , the LI cache 605 and the main memory 601.
  • the cache SRAM control circuit 712 implements the cache strategy of the L2 cache 690 and determines timing for reading to and from the SRAM data and code memory 710 and the cache tag memory 718.
  • the tag control circuit 714 determines cache status and initiates control of the cache in response to the cache status. In particular, tag control circuit 714 determines whether a cache hit or miss cycle is occurring and, for a cache hit cycle, whether the hit occurs on a modified cache line. In response to this determination, the tag control circuit 714 sets signals on the L2HIT#, L2HITM# and MISS lines. The tag control circuit 714 applies control lines to the cache SRAM control circuit 712 that select the designated way upon which the cache access is directed. The tag control circuit 714 initiates timing cycles, generates write enable signals, tag select signals and chip enable signals, controls transmission of these signals to the SRAM data and code memory 710 and applies the tag portion of the address to the cache tag memory 718. The tag control circuit 714 implements a line replacement algorithm. Furthermore, the tag control circuit 714 also determines which line is to be replaced using a designated replacement algorithm.
  • a main memory bus transaction occurs to an address that is replicated in the cache.
  • An extended address is applied to the tag control circuit 714 to determine whether the address is presently contained in the L2 cache 690 and, if so, to determine whether the addressed data is valid. If no valid line is contained in the cache tag memory 718, the tag control circuit 714 generates a signal on the MISS# line. If a modified line is contained in the cache tag memory 718. the tag control circuit 714 generates a signal on the HITM# line. If a valid line, which is not modified, is contained in the cache tag memory 718, the tag control circuit 714 generates a signal on the HIT# line. These signals are analyzed to determine whether an error occurs on a cache cycle.
  • the tag control circuit 714 controls a replacement algorithm to select which of the four cache data RAM ways 730, 732, 734 and 736 is replaced when a line is updated .
  • One replacement algorithm is a least recently used (LRU) algorithm, in which the cache SRAM control circuit 412 monitors accesses of each cache line of L2 cache 690, determines the order of accesses and classifies this order. The way that is classified as being accessed least recently is selected for replacement.
  • LRU least recently used
  • a related replacement algorithm is a pseudo-least recently used (pseudo-LRU) method, which utilizes three bits per cache line to monitor cache accesses.
  • a first bit is used to monitor accesses to the upper half ways 734 and 736 with respect to the lower half ways 730 and 732. The first bit is asserted, for example, when the most recent access strikes the upper half ways 734 and 736. The first bit is deasserted when the most recent access strikes the lower half ways 730 and 732.
  • a second bit is included to monitor accesses to the upper quarter way 736 with respect to the third quarter way 734. The second bit is asserted when an access strikes the upper quarter way 736 and is deasserted when an access strikes the third quarter way 734.
  • the second bit is updated only when the first bit is asserted.
  • a third bit is included to monitor accesses to the lower quarter way 730 with respect to the second quarter way 732. The third bit is asserted when an access strikes the second quarter way 732 and is deasserted when an access strikes the lower quarter way 730. The third bit is updated only when the first bit is asserted.
  • An alternative replacement algorithm is a random replacement technique in which the way to be replaced is randomly selected.
  • a further alternative replacement algorithm is a not last used (NLU) method which utihzes a pointer that points to the most recently used way.
  • the pointer stores a copy of the number of the way that was last accessed for any particular set address.
  • NLU not last used
  • the concept of the NLU method is that random selection is generally suitable but is improved by avoiding random selection of the last-used way for a set address since that way is the most likely way to be accessed.
  • the replacement algorithm select circuit 760 is connected to the tag control circuit 714 and supplies control signals that direct the tag control circuit 714 to replace a particular line.
  • the replacement algorithm select circuit 714 is connected to a replacement algorithm control register 762 via a control line.
  • the replacement algorithm control register 762 is positioned at an I/O location that is accessible to software and is programmed to select one of a plurality of replacement algorithms.
  • the replacement algorithm select circuit 760 is constructed from field programmable gate arrays (FPGA) as a portion of the secondary cache circuit.
  • FPGA field programmable gate arrays
  • the pseudo least recently used algorithm is implemented using software that controls the replacement algorithm select circuit 760.
  • the replacement algorithm select circuit 760 is connected to several LRU bit lines to receives three lines that correspond to three input data bits and three output data lines. These lines are controlled so that, each time a cache line is replaced or read, the replacement algorithm select circuit 760 reads the LRU lines to determine which line to replace. The replacement algorithm select circuit 760 replaces the specified line to implement the algorithm.
  • the cache decode circuit 716 is connected to the microprocessor 600 via extended address (EADS#) lines, byte enable lines and control lines.
  • the cache decode circuit 716 accesses memory address signals, data, byte enable signals and miscellaneous control signals for the L2 cache 690 at a rising edge of the clock which drives microprocessor timing.
  • the cache decode circuit 716 determines the type of cache cycle that is driving the caches.
  • the basic cache cycles include cache read hit, cache read miss, tag read and memory read cycles. In a cache read miss cycle, the microprocessor requests caching of a particular line in main memory 601 that is not currently in the cache.
  • the cache decode circuit 716 sends a signal to the SRAM data and code memory 710 via the cache SRAM control circuit 712 and sends a signal to the tag control circuit 714 to control storing of data from the main memory 601 into the SRAM data and code memory 710.
  • the tag control circuit 714 responds to the cache read miss signal by initiating timing cycles, generating write enable signals, tag select signals and chip enable signals, controlling transmission of these signals to the SRAM data and code memory 710 and applying the tag portion of the address to the cache tag memory 718.
  • the cache SRAM control circuit 712 awaits burst ready (BRDY#) signals from the microprocessor 600.
  • the cache SRAM control circuit 712 With each BRDY# signal, the cache SRAM control circuit 712 writes a timing signal to the particular way that is being accessed. The way that is accessed is selected in accordance with a particular replacement algorithm that is implemented in the tag control circuit 714.
  • the tag control circuit 714 applies control lines to the cache SRAM control circuit 712 that select the designated way upon which the cache access is directed. Thus, the same way in the SRAM data and code memory and in the cache tag memory 718 are accessed simultaneously.
  • the cache read hit cycle occurs as the microprocessor 600 requests a line that is cacheable but does not exist in the L2 array 690.
  • the tag control circuit 714 responds to the cache read hit signal by simply acknowledging the read hit on the L2HIT# line.
  • the cache SRAM control circuit 712 responds to the cache read hit signal by activating the SRAM data and code memory 710 to transfer data back to the microprocessor 600 at high speed.
  • the cache tag memory 718 serves as a cache directory and contains a list of the main memory addresses of the data and instruction code that is stored in corresponding locations of the SRAM data and code memory 710. Thus for each location in SRAM data and code memory 710, not only is a data or code stored in the SRAM data and code memory 710 but also an address is stored in the cache tag memory 718.
  • the cache tag memory 718 is most basically described as a static RAM (SRAM) memory where the actual main memory address is stored.
  • SRAM static RAM
  • the cache tag memory 718 is addressed in a manner that is similar to page addressing in which a memory index is received that corresponds to upper address bits. Other data stored in the cache tag memory 718 includes lower address bits for a line.
  • all 32 bits of a microprocessor address are received by the tag control circuit 714 for storage in the cache tag memory 718. Least significant bits ⁇ 3:0> are ignored, middle bits ⁇ 13:4> are used as compare address and most significant bits ⁇ 31 : 14> designate an address.
  • the tag control circuit 714 compares the compare address that is applied to the L2 cache 690 to the compare address that is stored in the cache tag memory 718 that is stored at a particular address. If the compare addresses are the same, a cache hit results.
  • the cache tag memory 718 is connected to receive control signals from the tag control circuit 714. Like the SRAM data and code memory 710, the cache tag memory 718 is readable.
  • the LI cache 105 similarly allows data and instruction tags to be accessed through software calls to special registers to provide for this comparison. Accordingly, both the data and code in the SRAM data and code memory 710 and the addresses in the cache tag memory 718 are readable in the manner of a block of main memory at a predetermined main memory address.
  • the cache tag memory 718 is a 4-way set associative cache so that each tag has three portions including a tag, an index and a block offset.
  • the index represent a set address which selects the set of the 64K sets in the SRAM data and code memory 710.
  • the tag is the remainder of the address generated by the processor after the set bits are removed.
  • Tag bits of the address are compared with tag bits of the cache tag memory 718 which are stored at the same address. If the tags match, then data or code stored within the SRAM data and code memory 710 at the same address represent the contents of the main memory location that is accessed. Because the L2 cache 690 is a 4-way set-associative cache, four cache tag RAM ways 740, 742, 744 and 746 are included in the cache tag memory 718.
  • the back invalidation buffer and control circuit 720 is used for replacement following a cache miss. Following a cache miss, the tag control circuit 714 determines which line is to be replaced using a designated replacement algorithm. The tag control circuit 714 selects the way to be replaced and directs that data in the way to be replaced, since the data is valid data, is transferred to the back invalidation buffer and control circuit 720. At the same time, the tag control circuit 714 directs that the upper address bits of the replaced way are transferred from the cache tag memory 718 to the back invalidation buffer and control circuit 720. When the buffer in the back invalidation buffer and control circuit 720 is full, the circuit 720 asserts a hold signal to the microprocessor 600.
  • the microprocessor 600 subsequently returns a hold acknowledge signal, to which the back invalidation buffer and control circuit 720 responds by returning a start address and EADS# to the microprocessor 600. If the line stored in the back invalidation buffer and control circuit 720 is a modified line, the microprocessor 600 performs a write operation to memory to store the line, as modified. If the line is not modified, it is simply invalidated and not written to memory. In this manner, coherency between the LI cache 105, the L2 cache 690 and the main memory 601 is maintained.
  • the buffer of the back invalidation buffer and control circuit 720 is only one line deep. Typically, the buffer would be several lines deep.
  • Signals on the HIT# and HITM# pins reflect results of LI cache 605 operations that take place during a particular clock cycle, the inquire cycle, of the microprocessor.
  • An inquire cycle is initiated by the microprocessor to determine whether a line is present in the instruction or data cache and the state of the line.
  • An inquire cycle is driven to the microprocessor when a bus master other than the microprocessor initiates a read or write bus cycle.
  • An inquire cycle is driven to the microprocessor when the bus master initiates a read to determine whether the data cache 650 contains the latest information. If a snooped line is in the data cache 650 in a modified state, the microprocessor has the most recent information and is to schedule a writeback of data to the main memory 601.
  • An inquire cycle is driven to the microprocessor when the other bus master initiates a write to determine whether the microprocessor instruction cache 604 or data cache 650 contains the snooped line and to invalidate the line if the line is present.
  • the microprocessor sets a hit HIT# indication signal to designate the outcome of an inquire cycle. If an inquire cycle hits a valid line in either the instruction cache 604 or the data cache 650 of the LI cache 605, the HIT# pin is asserted two clocks after an EADS# signal is sampled asserted. The EADS# signal indicates that a valid external address has been driven onto the processor address pins which are used for an inquire cycle. If the inquire cycle misses the LI cache 605, the HIT# pin is deasserted two clocks after an EADS# signal is sampled asserted. The HIT# pin is modified only as a result of an inquire cycle and holds a value between inquire cycles.
  • the microprocessor sets a hit modified line HITM# indication signal to designate the outcome of an inquire cycle.
  • the HITM# indication signal is asserted after inquire cycles that resulted in a hit to a modified line in the data cache 650 of the LI cache 605.
  • the HITM# bit is used to inhibit another bus master from accessing the data until the line is completely written back to the main memory 601.

Abstract

A computer system including a processor, a main memory and a cache memory uses tagging of various regions of memory to define and select caching properties of transfers between the processor and memory via the cache. The main memory contains not only standard random access memory (RAM) and read-only memory (ROM) but also memory-mapped input/output (I/O) sources. Tagging of the memory regions configures the regions for association with a particular set of caching properties. For example, a memory-mapped video I/O buffer may be tagged with a MM_IO_VBUF tag designating the caching properties of write-back cacheability with weak read/write ordering. Low-level operating system software, such as the Hardware Abstraction Language (HAL) interface of the Windows NTTM operating system or device driver software, initialize the memory regions, the cache and make symbolic associations between the memory regions and the cache. The cache, operating as directed by the memory tags, allows read and write operations that are used for performing various types of multimedia or signal processing operations including decompression, drawing operations, compression, mixing, and the like, which are performed on a virtually-cached multimedia drawing surface. The data for performing the multimedia or signal processing operations are either already located on the I/O surface or read from another storage location local to the processor or from an external processor, which is also cached and tagged as a special cached region. The processor executes operations acting on the cached data. When all operations are completely executed by the processor, only the cached memory regions are flushed using a flush instruction such as CFLSH[MM_IO_VBUF]. The flush instruction directs the cache to write back and invalidate the regions having the designated tag, here MM_IO_VBUF.

Description

A CACHE SYSTEM AND METHOD USING TAGGED CACHE LINES FOR MATCHING CACHE STRATEGY TO I/O APPLICATION
TECHNICAL FIELD
The present invention relates to computer systems. More specifically, the invention relates to a cache memory and method of operating a cache memory in a computer system.
BACKGROUND ART
A cache memory is a high-speed memory array which acts as a buffer between a CPU and a main memory of a computer system. A cache memory is effective because most software repeatedly executes the same instructions or manipulates data in the same spatial location in the memory. A cache memory increases effective memory speed by accessing slower main memory only once to fetch instructions or data with subsequent executions operating from the faster cache memory. A cache memory includes a data memory, a controller, and a directory.
The cache memory controls the CPU. If the cache memory includes a valid copy of data requested by the CPU during a CPU read cycle, then the cache allows the CPU to operate at the speed of the cache. If the cache does not contain a copy of the data, then the cache initiates a main memory read cycle, copies the data supplied by the main memory which the main memory indicates is valid using a ready signal, and allows the CPU to continue operating. Data applied to the CPU is routed by the cache controller either from the cache for the case of a cache hit or from buffers that isolate the main memory from the CPU for a cache miss. The cache hit results in no wait states. For the cache miss, the cache controller disables a memory ready signal so that the CPU inserts wait states. The cache memory performs this control function by intercepting all signals from the CPU, including input signals and output signals, thereby isolating the CPU from other devices.
Generally, when a cache miss occurs, not only the requested data bytes but also a complete cache line is read from the main memory into the cache memory in an operation called a cache line fill. A cache line typically includes 16 or 32 bytes. The cache line fill replaces a line in the cache. Data that is changed while in the cache is written to memory prior to replacement. One technique for writing changed cache data to memory is a write-through technique in which a write operation of the CPU causes a data transfer to main memory both for cache hits and cache misses. The write-through technique advantageously ensures consistency of the main memory, if only for single-processor systems. The write-through technique disadvantageously causes wait states.
An alternative technique for writing changed cache data to memory is a write-back cache technique in which all write operations are collected and the main memory is only updated upon receipt of an explicit update command or inquiry. One example of such a command is a software "write-back and invalidate data cache" ( BINVD) instruction. Another example of such a command is a hardware FLUSH signal which occurs implicitly as the result of a cache miss. The disadvantage of the write-back cache is that the exchange of cache lines has a longer duration because data must be written into memory before new data is read into the CPU.
A further alternative technique is a write-allocate technique in which a the cache controller implements a write-allocate strategy by filling the cache space for a cache line with the data for the address to be written. Usually, the data is first written through to the memory. The cache controller then reads the applicable cache line with the entry to be updated into the cache. The data is written through so that the CPU can restart program execution. The cache controller independently performs the write-allocate in parallel with the CPU operation.
With the advent of multimedia computer systems, a wide variety of data flows, data volumes and data characteristics are handled by caches. Multimedia computers access data in the forms of video data flows, audio data flows, video frames and the like, in addition to conventional handling of data and instructions. Caching characteristics of the different data flows are substantially different. For example, video data may flow through a cache memory while undergoing recursive processing in which the same data is processed periodically. The video data quickly is replaced due to the large volumes of data that is processed. The same data then must be brought back into the cache for recursive processing. This processes constantly repeats causing considerable thrashing of the cache memory.
Similarly, audio data typically involves smaller data volumes but is most efficiently handled using cache strategies that are different from strategies used for handling video data, conventional data or operating instructions.
What is needed is an apparatus and method for controlling the caching of different types of data which effectively considers the characteristics of the various data types. DISCLOSURE OF INVENTION
In accordance with the present invention, a computer system including a processor, a main memory and a cache memory uses tagging of various regions of memory to define and select caching properties of transfers between the processor and memory via the cache. The main memory contains not only standard random access memory (RAM) and read-only memory (ROM) but also memory-mapped input/output (I/O) sources. Tagging of the memory regions configures the regions for association with a particular set of caching properties. The caching properties are tagged according to the character of the data, in terms of data locality, the volume and timing of data transmission, the type of processing performed on the data, and the like. For example, a memory-mapped video I/O buffer may be tagged with a
MM_IO_VBUF tag designating the caching properties of write-back cacheability with weak read/write ordering. Low-level operating system software, such as the Hardware Abstraction Language (HAL) interface of the Windows NT™ operating system or device driver software, initialize the memory regions, the cache and make symbolic associations between the memory regions and the cache. The cache, operating as directed by the memory tags, allows read and write operations that are used for performing various types of multimedia or signal processing operations including decompression, drawing operations, compression, mixing, and the like, which are performed on a virtually-cached multimedia drawing surface. The data for performing the multimedia or signal processing operations are either already located on the I/O surface or read from another storage location local to the processor or from an external processor, which is also cached and tagged as a special cached region. The processor executes operations acting on the cached data. When all operations are completely executed by the processor, only the cached memory regions are flushed using a flush instruction such as CFLSH[MM_IO_VBUF]. The flush instruction directs the cache to write back and invalidate the regions having the designated tag, here MM IO VBUF.
Many advantages are attained by the described cache system and operating method. One advantage is that various types of calculations are performed in place on memory-mapped I/O buffers very quickly, reducing the time of total operations and increasing computer system performance. BREEF DESCRIPTION OF DRAWINGS
The features of the described embodiments believed to be novel are specifically set forth in the appended claims. However, embodiments of the invention relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings.
FIGURE 1 is a schematic block diagram illustrating a suitable computer system for incorporating a cache memory and a plurality of memory-mapped I/O surfaces in accordance with an embodiment of the present invention.
FIGURE 2 is a schematic block diagram showing a suitable virtual cache structure for implementing a cache system and method using tagged cache lines for matching cache policy to I/O application.
FIGURE 3 is a memory map which illustrates an embodiment of a main memory made up of a plurality of memory regions.
FIGURE 4 shows a storage for each memory region including memory region tag information.
FIGURE 5 is a schematic block diagram shows an organization of a low-level operating system for controlling the operations of the cache.
FIGURE 6 is a block diagram of a microprocessor connected to a secondary cache in accordance the present invention.
FIGURE 7 is a block diagram which illustrates the functional and structural blocks of a secondary cache in accordance with the present invention.
MODES FOR CARRYING OUT THE INVENTION The following sets forth a detailed description of the best contemplated mode for caπying out the invention. The description is intended to be illustrative of the invention and should not be taken to be limiting.
Referring to FIGURE 1, a schematic block diagram illustrates a suitable audio and video performance computer system 100 for incorporating a cache storage 102 and a plurality of memory-mapped I/O devices including an audio peripheral 104, a motion video peripheral 106, a small systems computer interface (SCSI) host bus adapter 108, a local area network (LAN) adapter 110, a serial input/output (SIO) PCI/ISA bridge 112, and a graphics adapter 114. The various memory-mapped I/O devices include regions of memory that are tagged to define region-specific cache operating characteristics, allowing for the cache strategy to be matched to the operations and functions performed by devices within the selected memory regions. Various other memory-mapped interface devices, which are known in the computing arts, may also be included in the computer system 100 and suitably tagged to define cache operating characteristics. The computer system 100 employs an architecture based on a bus, such as a PCI bus interface 120, and includes a central processing unit (CPU) 116 connected to the PCI bus interface 120 through a Host/PCI/Cache interface 118. The Host/PCI/Cache interface 118 includes the cache storage 102. The CPU 116 is connected to a main system storage 122 through the Host/PCI/Cache interface 118.
Referring to FIGURE 2, a schematic block diagram shows a suitable virtual cache structure for implementing a cache system and method using tagged cache lines for matching cache policy to I/O application. The illustrative system includes a processor 200, a memory management unit (MMU) 202, address buffers 204, data buffers 206, a system bus 208 and a virtual cache 210. The processor 200 accesses information in a main storage (not shown) connected to the system bus 208 via the virtual cache 210 and uses tagging of various regions of the main storage to define and select caching properties of transfers between the processor 200 and storage via the virtual cache 210.
The virtual cache 210 includes a cache memory 212 including a plurality of cache lines and a tag memory 214 having a plurality of tags. Each tag of the tag memory 214 corresponds to a cache line of the cache memory 212. The tags include a plurality of information fields defining a state of the corresponding cache line. A cache line is the smallest division of a cache memory for which there is a distinct tag. A cache line typically may include one or more words of cache memory. One of the information fields in the tag memory 214 is an address field 216 specifying a memory address in the memory space that is associated to the cache line at a particular time. Other information fields contain state information that is permanently stored in association with a memory region containing the memory address held in the address field 216. In this manner, the memory regions in the memory space include a tag storage holding state information pertinent to the memory region. This information is written to the tag memory 214 when the cache line is associated to the memory region. Referring to FIGURE 3, a memory map 300 is shown which illustrates an embodiment of a main memory 301 made up of a plurality of memory regions. The main memory 301 contains not only standard random access memory (RAM) and read-only memory (ROM) but also memory-mapped input/output (I/O) sources. In the illustrative example, the main memory 301 includes an operating system region 302, a program memory region 304, an I/O buffer region 310, a video frame buffer region 312, an audio buffer region 314, a high-speed audio/video device buffer 316, and the like. The various memory regions include a storage holding tag information that is transferred to the tag memory 214 shown in FIGURE 2 when a cache line is associated to a memory region. The tag information is advantageously selected to improve caching and memory access performance for the particular type of information that is transferred from the particular memory region.
Typically the processor 200 performs various operations that are specific to a particular memory region. Often the algorithms, types of data, volume of data, and data handling conditions and characteristics are specific to a particular memory region. For example, the processor 200 generally performs algorithmic and computation-intensive operations such as mixing, down sampling, up sampling, cancellation, and the like on data such as audio and video data. These operations are rarely performed on information in a region such as a program memory region but are often performed on data in memory-mapped regions connected to signal sources. In addition, audio and video data is generally accessed in large buffers of data, sometimes in buffers of one megabyte or larger. When data of this volume is accessed using a conventional cache structure, locality, the principal of cache operation, is quickly destroyed. Since data handling of the different memory regions has widely variable characteristics, the specific memory regions include tag information that is written to the tag memory 214 when a cache line is associated to the memory region, thereby directing the specified caching policies so that the cache line handles the data in a manner which is favorable to the data characteristics of the memory region. Examples of caching properties, attributes and strategies that are defined and selected by the tags in the tag memory 214 include read-cacheability, writeback cacheability with strong read/write ordering, writeback cacheability with weak read/write ordering.
Referring to FIGURE 4, an illustrative storage 400 for each memory region includes memory region tag information in a cache technique field 410, a cache line allocate field 412, a replacement algorithm control field 414, and a special codes field 416. The storage 400 may be located at a memory-mapped address within the memory region or may be held elsewhere, for example, in an operating system region of memory.
The cache technique field 410 is used to specify the caching technique employed for a particular memory region. For example, write-back, write-through and read-through caching techniques may be specified.
A write-through cache always updates main memory during all write cycles. Three alternative write-through caching techniques include: (1) Updating the cache if a hit occurs and the cache ignores the write cycle if a miss occurs. (2) If a miss occurs, the line is automatically invalidated. (3) A line in the cache is written whether the cycle is a hit or miss. Using a write-back caching technique the cache does not always update the main memory but reduces main memory accesses by writing data to the cache alone at a much higher speed that data is written to main memory. Data written into a cache by the CPU is not written into main memory until the data line in cache is to be replaced. A Dirty bit is set in the cache when a location has been written. If the Dirty bit is not set, the contents of that cache location can be destroyed. Advantages of write-back caching are: (1) Write cycles are much faster since a main memory cycle is otherwise the penalty for each CPU write operation. (2) Some write cycles, such as loop counters and stack entries, are only written to main memory a fraction of the times that the CPU attempts to write the cycles. (3) The processor is on the main memory bus a lower percentage of the time. The third factor is highly important in tightly coupled multiprocessor systems where the speed of the entire system depends on the main memory bus traffic. Disadvantages of write-back caching are: (1) The processor must handle conflicts between data written to cache but not to main memory since an updated cache line must be transferred to main memory upon removal from the cache. Updating of the main memory is typically controlled using a "Dirty" bit for each cache line which designates whether a line in the cache is more current than its corresponding main memory location. The Dirty bit allows only those lines in the cache that have been modified to be written to main memory, thereby conserving a substantial amount of bus bandwidth with unnecessary write operations. Data which are written to the cache but not updated in the main memory are evicted to the main memory during replacement when the Dirty bit is set. The cache line allocate field 412 associates a particular region of memory to a particular cache line in the cache memory 212. In various embodiments, the cache line allocate field 412 specifies operational characteristics such as characteristics controlling how the cache line is retained in the cache memory 212, controlling how the cache line is written back to memory, and preventing automatic allocation of cache lines to the memory region.
The replacement algorithm control field 414 is used to select a particular cache replacement algorithm from a plurality of implemented replacement algorithms. Implemented replacement algorithms may includes any suitable known replacement algorithms including, but not limited to, a least recently used (LRU) algorithm, a pseudo-least recently used algorithm, and a round robin replacement algorithm.
The special codes field 416 includes codes for selecting special cache operations. For example, a special code may be included for activating a filter storage operation in which a plurality of locations in the cache line are reserved for storing intermediate values determined in a filtering operation or to cache selected variables that are used repeatedly in a length computation. These storage locations are allocated and fixed in the cache memory so that, as the computation executes through a data stream, cache misses are avoided since frequently- used intermediate calculation variables are fixed in the cache memory. By avoiding cache misses and reads from main memory, performance is enhanced over the course of a lengthy computation across a data stream.
In one embodiment the cache, operating as directed by the special codes field 416, stores data for read and write operations that are used for performing various types of multimedia or signal processing operations including decompression, drawing operations, compression, mixing, and the like, which are performed on a virtually-cached multimedia drawing surface. Data for performing the multimedia or signal processing operations are either located on the I/O surface or read from another storage location local to the processor or from an external processor, which is also cached and tagged as a special cached region. The processor executes operations acting on the cached data. When all operations are completely executed by the processor, only the cached memory regions are flushed using a flush instruction such as CFLSH[MM_IO_VBUF]. The flush instruction directs the cache to write back and invalidate the regions having the designated tag, here MM IO VBUF.
A conventional cache has reduced performance during handling of large streams of data for signal processing applications such as multimedia applications. Reduced performance results because the large streams of data disrupt the locality principle upon which caches operate, causing thrashing as the same data is repeatedly inserted and removed from the cache.
The illustrative caching system adapts the cache to implement caching policies that are advantageous to the data handling of a particular memory region so that locality is maintained even during handling of large amounts of data, large data streams, and lengthy computations.
Referring to FIGURE 5, a schematic block diagram shows an organization of a low- level operating system for controlling the operations of the cache. Low-level operating system software, such as the Hardware Abstraction Language (HAL) interface of the Windows NT™ operating system or device driver software such as a video or multimedia device driver, initializes the memory regions, the cache and make symbolic associations between the memory regions and the cache.. The low-level operating system 500 includes a hardware abstraction level (HAL) 510, a processor/machine 512, and, above the HAL 510, various drivers. The drivers include an audio driver 514, a video driver 516, and a signal processing filter 518. Each of these operating system structures operates using control information accessed from tags in the various memory regions, achieving a large degree of flexibility in the operation of the operating system. Regions of system memory are dedicated to particular applications and the cache configured to most efficiently handle data for the particular applications. Drivers, such as the audio driver 514 and the video driver 516, initialize the cache as data is accessed from a specific memory region. The driver accesses the tagged information from the memory region, uses the information to determine system capabilities, and adapts the structure and operations of a cache line according to the tagged information. The information may include, for example, a base memory address, a list of attributes, user control information. The drivers then query the hardware abstraction level 510, using the tagged information to specify resource requirements. Tagging of the memory regions configures the regions for association with a particular set of caching properties. For example, a memory-mapped video I/O buffer may be tagged with a MM IO VBUF tag designating the caching properties of write-back cacheability with weak read/write ordering. The hardware abstraction level 510 allocates resources, for example, designating a cache line for operation as an audio buffer and another cache line for operation as a frame buffer, and designating the length of the buffers. In some embodiments, the tagged memory information specifies latency information for usage by the drivers in determining appropriate buffer sizes. Referring to FIGURE 6, a superscalar X86 microprocessor 600 executes the X86 instruction set. Microprocessor 600 is connected to physically addressed external memory 601 via a 486 XL bus or other conventional microprocessor bus. Microprocessor 600 includes a primary (LI) cache 605 incorporating an instruction cache 604 and a data cache 650. The instruction cache 604 is connected to byte queue 606 which is connected to instruction decoder 608. Instruction decoder 608 is connected to RISC core of the microprocessor 600. RISC core includes register file 612 and reorder buffer 614 as well as a variety of functional units such as shift unit 630 (SHF), arithmetic logic units 631, 632 (ALU0 and ALU1), special register block 633 (SRB), load/store unit 634 (LSSEC), branch section 635 (BRNSEC), and floating point unit 636 (FPU). The microprocessor 600 is a single VLSI chip circuit. In a circuit that is connected to the microprocessor circuit VLSI chip, a secondary (L2) cache 690 is supplied which is connected to the primary cache 605 of the microprocessor chip.
The LI cache 605 is organized as a split data and instruction cache including an 8K byte 4-way set-associative data cache 650 and a 16K byte 4-way set-associative instruction cache 604. Both the data cache 650 and the instruction cache 604 are virtually addressed and virtually tagged to avoid translation of addresses before a cache access. A single set of physical tags is shared by both the instruction cache 604 and the data cache 650. When a change is made to the virtual-to-physical mapping of the LI cache 605, the virtual cache tags are invalidated. The instruction cache is organized as 1024 blocks of 16 predecoded X86 instruction bytes. The LI cache 605 has an address tag array that is dual-ported and contains 1024 entries, each including a 20-bit linear address tag, a single valid bit designating valid status for the entire block and 16 individual byte-valid bits, one for each of the sixteen corresponding instruction bytes within the instruction store array. The data cache 650 is dual-ported and connected to dual load/store units 634 to allow two accesses to the 8K byte data cache 650 to be performed in a single clock cycle so long as no two accesses are to the same bank. Dual load/store units 634 are implemented to address a high incidence of load and store operations that typically result due to the small number of registers provided in an x86 architecture. Ports (not shown) of the data cache 650 support one or two load operations or, alternatively, one or two store operations.
The data cache 650 includes four banks and has two access ports, one for each load/store unit 634. Data accesses to different banks proceed in parallel although two accesses to the same cache bank occur in the same cycle if both accesses are applied to the same cache line.
Data cache 650 is connected to load/store functional unit 634 of the RISC core and with intraprocessor address and data (I.AD) bus 602. Instruction cache 604 is also connected with IAD bus 602. The instruction cache 604 and data cache 650, in combination, form the primary (LI) cache 605 on the microprocessor chip. The primary (LI) cache 605 is connected to the secondary (L2) cache 690 which is not fabricated on the VLSI microprocessor chip. Instruction cache 604 and data cache 650 are both linearly addressable caches. Instruction cache 604 and data cache 650 are physically separate, however, both caches are organized using the same architecture. Microprocessor 600 also includes a physical tag circuit 662. Physical tag circuit 662 interacts with both instruction cache 604 and data cache 650 via the IAD bus.
RISC core 610 includes A and B operand buses 616, type and dispatch (TAD) bus 618 and result bus 640 which are connected to the functional units as well as displacement and instruction, load store (INLS) bus 619 which is connected between instruction decoder 608 and load/store unit 634. A and B operand buses 616 are also connected to register file 612 and reorder buffer 614. TAD bus 618 is also connected to instruction decoder 608. Result bus 640 is also connected to reorder buffer 614. Additionally, branch section 635 is connected to reorder buffer 614, instruction decoder 608 and instruction cache 604 via Xtarget bus 603. A and B operand buses 616 includes four parallel 41 -bit wide A operand buses and four parallel 41 -bit wide B operand buses as well as four parallel 12-bit wide A tag buses, four parallel 12- bit wide B tag buses, a 12-bit wide A tag valid bus a 12-bit wide B tag valid bus, four 4-bit wide destination tag buses and four 8-bit wide opcode buses. Type and dispatch bus 618 includes four 3-bit wide type code buses and one 4-bit wide dispatch buses. Displacement and INLS bus 619 includes two 32-bit wide displacement buses and two 8-bit wide INLS buses.
Microprocessor 600 also includes memory management unit (MMU) 664 and bus interface unit 660 (BIU). TLB 664 is connected with the IAD bus and physical translation circuit 662. Bus interface unit 660 is connected to physical translation circuit 662, data cache 650 and IAD bus 602 as well as an external microprocessor bus such as the 486 XL bus. Microprocessor 600 executes computer programs which include sequences of instructions. Computer programs are typically stored on a hard disk, floppy disk or other nonvolatile storage media, such as a BIOS ROM, which are located in the computer system. When the program is run, the program is loaded from the storage media into main memory 601. Once the instructions of the program and associated data are in main memory 601, individual instructions are prepared for execution and ultimately executed by microprocessor 600.
After being stored in main memory 601, the instructions are passed via bus interface unit 660 to instruction cache 604, where the instructions are temporarily held. The x86 instructions are provided by instruction cache 604 to instruction decoder 608 using a fetch program counter value which is generated by instruction cache 604.
Referring to FIGURE 7, the secondary (L2) cache 690 includes a static random access memory (SRAM) data and code memory 710, a cache SRAM control circuit 712, a tag control circuit 714, a cache decode circuit 716, a cache tag memory 718, a back invalidation buffer and control circuit 720 and a replacement algorithm select circuit 760. Like the LI cache 605, the L2 cache 690 is organized as a split data and instruction cache including an 8K byte 4-way set- associative data cache 650 and a 16K byte 4-way set-associative instruction cache 604. The line size of the L2 cache 690 is 16 bytes, matching the line size of the LI cache 605. Both the LI cache 605 and L2 cache 690 maintain coherency line by line through snooping, inclusion of contents of the LI cache 605 into L2 cache 690 storage. LI cache 605 invalidates are limited through usage of inclusion. According to the inclusion method, the L2 cache 690 directory is used to prescreen invalidation cycles before the cycles are passed to the smaller LI cache. Accordingly, the contents of the LI cache are always a subset of the contents of the L2 cache.
Characteristics of the LI cache 605 greatly influence the operations of the L2 cache 690. For example, the 8K byte 4-way set-associative data cache 650 and a 16K byte 4-way set- associative instruction cache 604 greatly reduce the number of external read cycles seen by the L2 cache 690 and the main memory 601. Accordingly, the processor external bus cycles are more likely to be write cycles than read cycles.
The L2 cache 690 is a split, 4-way set associative look-aside cache. The L2 cache 690 operates under the control of the processor. The L2 cache 690 sets a read condition as a default condition for the next processor cycle. The cycle therefore begins by enabling data RAM output signals onto the data bus. These RAM output signals are disabled under two conditions. First, when the processor W/R# output signal is high, indicating a requested write cycle. Second, during a read cycle after the cache-tag RAM detects a miss or a cycle to a noncacheable address. In the case of a cache hit, the cache-tag RAM and the tag control circuit 714 signify that the cache data is ready using a burst ready input signal so that the processor can continue the write operation. In the case of a cache miss, the processor is stalled and the cache data RAM is disabled from placing data onto the data bus. If the L2 cache 690 contains a valid copy of data or code requested by the processor during a processor read cycle, then the L2 cache 690 allows the processor to perform at the speed of the L2 cache 690. If the L2 cache 690 does not contain a copy of the data or code, then the L2 cache 690 initiates a read cycle of the main memory. The L2 cache 690 then copies the data or code supplied by the main memory and allows the processor to continue execution. Data or code accessed by the processor is routed by the cache SRAM control circuit 712 from the SRAM data and code memory 710 for a cache hit and otherwise from the main memory in the case of a cache miss. Accordingly, the L2 cache 690 intercepts input and output signals of the processor and determine whether these signals are to be applied to the main memory or retained local to the L2 cache 690. For a look-aside cache, the address designated by the processor propagates through the address buffers to the bus at the beginning of all cycles, whether the cycle is a read hit or a read miss cycle. A look-aside cache strategy allows a cache to be removed or inserted into a computer system with no effects other than performance. Main memory transactions are initiated before a cache hit or miss is detected and are aborted if a cache hit occurs. The lookaside L2 cache 690 initiates a main memory access synchronously with the SRAM data and code memory 710 access, rather than waiting for a cache miss before requesting a main memory access. In this manner, the main memory access time is reduced in the event of a cache miss.
A read miss cycle occurs when the contents of the cache tag memory 718 do not match the address directed by the processor. In this case, L2 cache 690 output signals to the data bus DATA(63:0) are terminated and data from the main memory is allowed to be transferred to the processor.
In a write operation, information is written to the L2 cache 690 for immediate or subsequent modification of the main memory. The L2 cache 690 implements a write-once protocol in which the processor must write to a main memory location once before the processor can maintain exclusive use of that location. During this write cycle, the address of the location is invalidated in all other caches. This exclusive use is terminated when another processor attempts to read that location from main memory. Whether the write to main memory is immediate or subsequent is determined by an implemented write policy. A write- through write policy involves writing data from the processor to both the cache and main memory simultaneously to assure coherency. In a copy-back write policy, data written into cache by the processor is not written into main memory until a data line in cache is to be replaced. A line is the smallest division of a cache memory that is labeled using a distinct tag.
A data transfer cycle is terminated by an external assertion of either the ready RD Y# or burst ready BRDY# input signals to the processor. Either or both of these signals are used to signify the presence of valid data and the end of the transfer cycle. The absence of either the ready RDY# or burst ready BRDY# input signal causes the processor to enter a WAIT state until the rising edge where one of these signals is sampled active.
The SRAM data and code memory 710 is a small and fast memory for storing replicas of instructions and data that, if accessed directly from a main memory, would generally be accessed more slowly. The SRAM data and code memory 710 is connected to receive control signals from the cache SRAM control circuit 712 and is connected to the data bus DATA(63:0) to form a data path for transferring data and code among the L2 cache 690, the LI cache 605 and the main memory 601. The SRAM data and code memory 710 is readable in the manner of a block of main memory at a predetermined main memory address. The readable nature of memory in the L2 cache 690 allows test software to be written which compares information in the SRAM data and instruction memory 710, including states of a line of cache memory, to information held in the LI cache 605. The LI cache 605 similarly allows data and instruction memory to be accessed through software calls to special registers and a data port to read out data in the tag and data arrays of the LI cache 605 to accommodate this comparison. SRAM data and code memory 710 has 64K sets with four 16-byte ways (also called banks) per set. Each of the four ways are substantially identical. The L2 cache 690 is a 4-way set-associative cache that uses four cache data RAM ways 730, 732, 734 and 736 in the SRAM data and code memory 710. The cache SRAM control circuit 712 implements a control algorithm which controls transfer of data and code among the SRAM data and code memory 710, the LI cache 605 and the main memory 601 and further controls transfer of addresses among the cache tag memory 718 , the LI cache 605 and the main memory 601. The cache SRAM control circuit 712 implements the cache strategy of the L2 cache 690 and determines timing for reading to and from the SRAM data and code memory 710 and the cache tag memory 718.
The tag control circuit 714 determines cache status and initiates control of the cache in response to the cache status. In particular, tag control circuit 714 determines whether a cache hit or miss cycle is occurring and, for a cache hit cycle, whether the hit occurs on a modified cache line. In response to this determination, the tag control circuit 714 sets signals on the L2HIT#, L2HITM# and MISS lines. The tag control circuit 714 applies control lines to the cache SRAM control circuit 712 that select the designated way upon which the cache access is directed. The tag control circuit 714 initiates timing cycles, generates write enable signals, tag select signals and chip enable signals, controls transmission of these signals to the SRAM data and code memory 710 and applies the tag portion of the address to the cache tag memory 718. The tag control circuit 714 implements a line replacement algorithm. Furthermore, the tag control circuit 714 also determines which line is to be replaced using a designated replacement algorithm.
During a snoop hit cycle, a main memory bus transaction occurs to an address that is replicated in the cache. An extended address is applied to the tag control circuit 714 to determine whether the address is presently contained in the L2 cache 690 and, if so, to determine whether the addressed data is valid. If no valid line is contained in the cache tag memory 718, the tag control circuit 714 generates a signal on the MISS# line. If a modified line is contained in the cache tag memory 718. the tag control circuit 714 generates a signal on the HITM# line. If a valid line, which is not modified, is contained in the cache tag memory 718, the tag control circuit 714 generates a signal on the HIT# line. These signals are analyzed to determine whether an error occurs on a cache cycle.
The tag control circuit 714 controls a replacement algorithm to select which of the four cache data RAM ways 730, 732, 734 and 736 is replaced when a line is updated . One replacement algorithm is a least recently used (LRU) algorithm, in which the cache SRAM control circuit 412 monitors accesses of each cache line of L2 cache 690, determines the order of accesses and classifies this order. The way that is classified as being accessed least recently is selected for replacement.
A related replacement algorithm is a pseudo-least recently used (pseudo-LRU) method, which utilizes three bits per cache line to monitor cache accesses. A first bit is used to monitor accesses to the upper half ways 734 and 736 with respect to the lower half ways 730 and 732. The first bit is asserted, for example, when the most recent access strikes the upper half ways 734 and 736. The first bit is deasserted when the most recent access strikes the lower half ways 730 and 732. A second bit is included to monitor accesses to the upper quarter way 736 with respect to the third quarter way 734. The second bit is asserted when an access strikes the upper quarter way 736 and is deasserted when an access strikes the third quarter way 734. The second bit is updated only when the first bit is asserted. A third bit is included to monitor accesses to the lower quarter way 730 with respect to the second quarter way 732. The third bit is asserted when an access strikes the second quarter way 732 and is deasserted when an access strikes the lower quarter way 730. The third bit is updated only when the first bit is asserted.
An alternative replacement algorithm is a random replacement technique in which the way to be replaced is randomly selected.
A further alternative replacement algorithm is a not last used (NLU) method which utihzes a pointer that points to the most recently used way. The pointer stores a copy of the number of the way that was last accessed for any particular set address. The concept of the NLU method is that random selection is generally suitable but is improved by avoiding random selection of the last-used way for a set address since that way is the most likely way to be accessed.
The efficiency and performance of a cache utilizing different cache policies typically is highly dependent on the software that is run. It is advantageous for various different policies to be implemented in a cache so that a maximally efficient cache policy is selected for running a particular software routine. It is additionally advantageous for the cache to automatically select a maximally advantageous cache policy during the execution of various software routines. The replacement algorithm select circuit 760 is connected to the tag control circuit 714 and supplies control signals that direct the tag control circuit 714 to replace a particular line. The replacement algorithm select circuit 714 is connected to a replacement algorithm control register 762 via a control line. The replacement algorithm control register 762 is positioned at an I/O location that is accessible to software and is programmed to select one of a plurality of replacement algorithms. In one embodiment, the replacement algorithm select circuit 760 is constructed from field programmable gate arrays (FPGA) as a portion of the secondary cache circuit.
One embodiment, the pseudo least recently used algorithm is implemented using software that controls the replacement algorithm select circuit 760. The replacement algorithm select circuit 760 is connected to several LRU bit lines to receives three lines that correspond to three input data bits and three output data lines. These lines are controlled so that, each time a cache line is replaced or read, the replacement algorithm select circuit 760 reads the LRU lines to determine which line to replace. The replacement algorithm select circuit 760 replaces the specified line to implement the algorithm.
The cache decode circuit 716 is connected to the microprocessor 600 via extended address (EADS#) lines, byte enable lines and control lines. The cache decode circuit 716 accesses memory address signals, data, byte enable signals and miscellaneous control signals for the L2 cache 690 at a rising edge of the clock which drives microprocessor timing. The cache decode circuit 716 determines the type of cache cycle that is driving the caches. The basic cache cycles include cache read hit, cache read miss, tag read and memory read cycles. In a cache read miss cycle, the microprocessor requests caching of a particular line in main memory 601 that is not currently in the cache. For example in the cache read miss cycle, the cache decode circuit 716 sends a signal to the SRAM data and code memory 710 via the cache SRAM control circuit 712 and sends a signal to the tag control circuit 714 to control storing of data from the main memory 601 into the SRAM data and code memory 710. The tag control circuit 714 responds to the cache read miss signal by initiating timing cycles, generating write enable signals, tag select signals and chip enable signals, controlling transmission of these signals to the SRAM data and code memory 710 and applying the tag portion of the address to the cache tag memory 718. The cache SRAM control circuit 712 awaits burst ready (BRDY#) signals from the microprocessor 600. With each BRDY# signal, the cache SRAM control circuit 712 writes a timing signal to the particular way that is being accessed. The way that is accessed is selected in accordance with a particular replacement algorithm that is implemented in the tag control circuit 714. The tag control circuit 714 applies control lines to the cache SRAM control circuit 712 that select the designated way upon which the cache access is directed. Thus, the same way in the SRAM data and code memory and in the cache tag memory 718 are accessed simultaneously.
In another example, the cache read hit cycle occurs as the microprocessor 600 requests a line that is cacheable but does not exist in the L2 array 690. The tag control circuit 714 responds to the cache read hit signal by simply acknowledging the read hit on the L2HIT# line. The cache SRAM control circuit 712 responds to the cache read hit signal by activating the SRAM data and code memory 710 to transfer data back to the microprocessor 600 at high speed.
The cache tag memory 718 serves as a cache directory and contains a list of the main memory addresses of the data and instruction code that is stored in corresponding locations of the SRAM data and code memory 710. Thus for each location in SRAM data and code memory 710, not only is a data or code stored in the SRAM data and code memory 710 but also an address is stored in the cache tag memory 718. The cache tag memory 718 is most basically described as a static RAM (SRAM) memory where the actual main memory address is stored. The cache tag memory 718 is addressed in a manner that is similar to page addressing in which a memory index is received that corresponds to upper address bits. Other data stored in the cache tag memory 718 includes lower address bits for a line. In one embodiment, all 32 bits of a microprocessor address are received by the tag control circuit 714 for storage in the cache tag memory 718. Least significant bits <3:0> are ignored, middle bits <13:4> are used as compare address and most significant bits <31 : 14> designate an address. The tag control circuit 714 compares the compare address that is applied to the L2 cache 690 to the compare address that is stored in the cache tag memory 718 that is stored at a particular address. If the compare addresses are the same, a cache hit results. The cache tag memory 718 is connected to receive control signals from the tag control circuit 714. Like the SRAM data and code memory 710, the cache tag memory 718 is readable. The LI cache 105 similarly allows data and instruction tags to be accessed through software calls to special registers to provide for this comparison. Accordingly, both the data and code in the SRAM data and code memory 710 and the addresses in the cache tag memory 718 are readable in the manner of a block of main memory at a predetermined main memory address.
The cache tag memory 718 is a 4-way set associative cache so that each tag has three portions including a tag, an index and a block offset. The index represent a set address which selects the set of the 64K sets in the SRAM data and code memory 710. The tag is the remainder of the address generated by the processor after the set bits are removed. Tag bits of the address are compared with tag bits of the cache tag memory 718 which are stored at the same address. If the tags match, then data or code stored within the SRAM data and code memory 710 at the same address represent the contents of the main memory location that is accessed. Because the L2 cache 690 is a 4-way set-associative cache, four cache tag RAM ways 740, 742, 744 and 746 are included in the cache tag memory 718.
The back invalidation buffer and control circuit 720 is used for replacement following a cache miss. Following a cache miss, the tag control circuit 714 determines which line is to be replaced using a designated replacement algorithm. The tag control circuit 714 selects the way to be replaced and directs that data in the way to be replaced, since the data is valid data, is transferred to the back invalidation buffer and control circuit 720. At the same time, the tag control circuit 714 directs that the upper address bits of the replaced way are transferred from the cache tag memory 718 to the back invalidation buffer and control circuit 720. When the buffer in the back invalidation buffer and control circuit 720 is full, the circuit 720 asserts a hold signal to the microprocessor 600. The microprocessor 600 subsequently returns a hold acknowledge signal, to which the back invalidation buffer and control circuit 720 responds by returning a start address and EADS# to the microprocessor 600. If the line stored in the back invalidation buffer and control circuit 720 is a modified line, the microprocessor 600 performs a write operation to memory to store the line, as modified. If the line is not modified, it is simply invalidated and not written to memory. In this manner, coherency between the LI cache 105, the L2 cache 690 and the main memory 601 is maintained.
In one embodiment, the buffer of the back invalidation buffer and control circuit 720 is only one line deep. Typically, the buffer would be several lines deep.
Signals on the HIT# and HITM# pins reflect results of LI cache 605 operations that take place during a particular clock cycle, the inquire cycle, of the microprocessor. An inquire cycle is initiated by the microprocessor to determine whether a line is present in the instruction or data cache and the state of the line. An inquire cycle is driven to the microprocessor when a bus master other than the microprocessor initiates a read or write bus cycle. An inquire cycle is driven to the microprocessor when the bus master initiates a read to determine whether the data cache 650 contains the latest information. If a snooped line is in the data cache 650 in a modified state, the microprocessor has the most recent information and is to schedule a writeback of data to the main memory 601. An inquire cycle is driven to the microprocessor when the other bus master initiates a write to determine whether the microprocessor instruction cache 604 or data cache 650 contains the snooped line and to invalidate the line if the line is present.
The microprocessor sets a hit HIT# indication signal to designate the outcome of an inquire cycle. If an inquire cycle hits a valid line in either the instruction cache 604 or the data cache 650 of the LI cache 605, the HIT# pin is asserted two clocks after an EADS# signal is sampled asserted. The EADS# signal indicates that a valid external address has been driven onto the processor address pins which are used for an inquire cycle. If the inquire cycle misses the LI cache 605, the HIT# pin is deasserted two clocks after an EADS# signal is sampled asserted. The HIT# pin is modified only as a result of an inquire cycle and holds a value between inquire cycles.
The microprocessor sets a hit modified line HITM# indication signal to designate the outcome of an inquire cycle. The HITM# indication signal is asserted after inquire cycles that resulted in a hit to a modified line in the data cache 650 of the LI cache 605. The HITM# bit is used to inhibit another bus master from accessing the data until the line is completely written back to the main memory 601.
While the invention has been described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention is not limited to them. Many variations, modifications, additions and improvements of the embodiments described are possible.

Claims

WHAT IS CLAIMED IS:
1. A computing apparatus comprising: a processor; a cache coupled to the processor; a plurality of memory regions coupled to the cache; a plurality of memory region tags associated to ones of the plurality of memory regions for tagging cache characteristics of the plurality of memory regions.
2. A computing apparatus according to Claim 1 further comprising: an operating system executable on the processor and including a routine for configuring the cache to the cache characteristics of a selected memory region including configuring symbolic associations.
3. A computing apparatus according to Claim 2 wherein: the operating system includes a hardware abstraction level of a Windows NT™ operating system which includes the routine for configuring the cache.
4. A computing apparatus according to Claim 2 wherein: the operating system includes a device driver which includes the routine for configuring the cache.
5. A computing apparatus according to Claim 1 further comprising: a memory-mapped video I/O buffer coupled to one of the plurality of memory regions.
6. A computing apparatus according to Claim 5 wherein: the memory-mapped video I/O buffer has an associated memory region tag which tags the memory-mapped video I/O buffer with a cache characteristic of writeback cacheable with weak read-write ordering.
7. A computing apparatus according to Claim 1 further comprising: a memory region tag of the plurality of memory region tags for allocating a memory region to a cache line.
8. A computing apparatus according to Claim 1 further comprising: a memory region tag of the plurality of memory region tags for configuring a cache replacement strategy to a selected memory region.
9. A computing apparatus according to Claim 1 further comprising: a memory region tag of the plurality of memory region tags for configuring a cache line size to a cache line allocated to a selected memory region.
10. A computing apparatus according to Claim 1 further comprising: a memory region tag of the plurality of memory region tags for configuring a cache line to include a plurality of storage locations for holding filter coefficients continuously while other storage in the cache line is periodically replaced.
1 1. A method of operating a computing apparatus including a processor, a cache coupled to the processor, and a plurality of memory regions coupled to the cache, the method comprising the steps of: tagging ones of the plurality of memory regions to define cache characteristics of the plurality of memory regions; selecting a memory region of the plurality of memory regions for supplying information for execution by the processor; and configuring the cache as directed by the tagging of the selected memory region.
12. A method according to Claim 1 1 wherein the computing apparatus further includes a memory-mapped video I/O buffer coupled to one of the plurality of memory regions, the method further comprising the step of: tagging the memory-mapped video I/O buffer with a cache characteristic of writeback cacheable with weak read-write ordering.
13. A method according to Claim 11 further comprising the step of: tagging a selected memory region of the plurality of memory regions for allocating a memory region to a cache line; and allocating a memory region to a cache line.
14. A method according to Claim 11 further comprising the step of: tagging a selected memory region of the plurality of memory regions for configuring a cache replacement strategy to a selected memory region; and configuring a cache replacement strategy to a selected memory region.
15. A method according to Claim 11 further comprising the step of: tagging a selected memory region of the plurality of memory regions for configuring a cache line size to a cache line allocated to a selected memory region; and configuring a cache line size to a cache line allocated to a selected memory region.
16. A method according to Claim 1 1 further comprising the step of: tagging a selected memory region of the plurality of memory regions for configuring a cache line to include a plurality of storage locations for holding filter coefficients continuously while other storage in the cache line is periodically replaced; and configuring a cache line to include a plurality of storage locations for holding filter coefficients continuously while other storage in the cache line is periodically replaced.
17. A method of providing a computing apparatus comprising the steps of: providing a processor; providing a cache coupled to the processor; providing a plurality of memory regions coupled to the cache, providing a plurality of memory region tags associated to ones of the plurality of memory regions for tagging cache characteristics of the plurality of memory regions.
18. A method according to Claim 17 further comprising the steps of: providing an operating system executable on the processor and including a routine for configuring the cache to the cache characteristics of a selected memory region including configuring symbolic associations.
19. A method according to Claim 18 wherein: the operating system includes a hardware abstraction level of a Windows NT™ operating system which includes the routine for configuring the cache.
20. A method according to Claim 18 wherein: the operating system includes a device driver which includes the routine for configuring the cache. 21 A method according to Claim 17 further comprising the step of providing a memory-mapped video I/O buffer coupled to one of the plurality of memory regions
22 A method according to Claim 21 wherein the memory-mapped video I/O buffer has an associated memory region tag which tags the memory-mapped video I/O buffer with a cache characteristic of writeback cacheable with weak read-wπte ordering
23 A method according to Claim 17 further comprising the step of providing a memory region tag of the plurality of memory region tags for allocating a memory region to a cache line
24 A method according to Claim 17 further comprising the step of providing a memory region tag of the plurality of memory region tags for configuπng a cache replacement strategy to a selected memory region
25 A method according to Claim 17 further comprising the step of providing a memory region tag of the plurality of memory region tags for configuπng a cache line size to a cache line allocated to a selected memory region
26 A method according to Claim 17 further compnsing the step of providing a memory region tag of the plurality of memory region tags for configuπng a cache line to include a plurality of storage locations for holding filter coefficients continuously while other storage in the cache line is peπodically replaced
PCT/US1997/008921 1996-07-22 1997-05-27 A cache system and method using tagged cache lines for matching cache strategy to i/o appication WO1998003919A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/681,105 US5915262A (en) 1996-07-22 1996-07-22 Cache system and method using tagged cache lines for matching cache strategy to I/O application
US08/681,105 1996-07-22

Publications (1)

Publication Number Publication Date
WO1998003919A1 true WO1998003919A1 (en) 1998-01-29

Family

ID=24733848

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1997/008921 WO1998003919A1 (en) 1996-07-22 1997-05-27 A cache system and method using tagged cache lines for matching cache strategy to i/o appication

Country Status (2)

Country Link
US (1) US5915262A (en)
WO (1) WO1998003919A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6856577B1 (en) * 1999-09-29 2005-02-15 Furuno Electric Company, Ltd. Method and apparatus for forming a reception beam

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272501B1 (en) * 1995-08-30 2001-08-07 Peter Baumann Database system for management of arrays
US6173368B1 (en) * 1995-12-18 2001-01-09 Texas Instruments Incorporated Class categorized storage circuit for storing non-cacheable data until receipt of a corresponding terminate signal
US6199152B1 (en) * 1996-08-22 2001-03-06 Transmeta Corporation Translated memory protection apparatus for an advanced microprocessor
US6275908B1 (en) * 1998-02-17 2001-08-14 International Business Machines Corporation Cache coherency protocol including an HR state
DE69903707T2 (en) * 1999-02-18 2003-07-10 Texas Instruments Inc Optimized hardware cleaning function for a data cache with virtual indexes and tags
US6385687B2 (en) * 1999-05-14 2002-05-07 Micron Technology, Inc. Method and apparatus for simultaneously accessing the tag and data arrays of a memory device
US6425058B1 (en) * 1999-09-07 2002-07-23 International Business Machines Corporation Cache management mechanism to enable information-type dependent cache policies
US6421761B1 (en) 1999-11-09 2002-07-16 International Business Machines Corporation Partitioned cache and management method for selectively caching data by type
US6690397B1 (en) * 2000-06-05 2004-02-10 Advanced Neuromodulation Systems, Inc. System for regional data association and presentation and method for the same
US6968469B1 (en) 2000-06-16 2005-11-22 Transmeta Corporation System and method for preserving internal processor context when the processor is powered down and restoring the internal processor context when processor is restored
US6647462B1 (en) 2000-06-29 2003-11-11 Motorola, Inc. Apparatus and a method for providing decoded information
US6557083B1 (en) * 2000-06-30 2003-04-29 Intel Corporation Memory system for multiple data types
US6684298B1 (en) 2000-11-09 2004-01-27 University Of Rochester Dynamic reconfigurable memory hierarchy
US7085891B2 (en) * 2001-04-04 2006-08-01 International Business Machines Corporation Method for managing a cache memory using a predictive modeling engine to select a caching algorithm
US6996688B2 (en) * 2003-03-11 2006-02-07 International Business Machines Corporation Method, system, and program for improved throughput in remote mirroring systems
US7377040B2 (en) * 2003-12-19 2008-05-27 Continental Automotive Systems Us, Inc. Method of manufacturing a polymeric bodied fuel injector
US20060143402A1 (en) * 2004-12-23 2006-06-29 Srinivas Chennupaty Mechanism for processing uncacheable streaming data
US8127088B2 (en) * 2005-01-27 2012-02-28 Hewlett-Packard Development Company, L.P. Intelligent cache management
JP2006236239A (en) * 2005-02-28 2006-09-07 Toshiba Corp Data processing system and data decompressing method
US8738926B2 (en) * 2008-01-10 2014-05-27 Intel Mobile Communications GmbH Data processing system, method for executing a cryptographic algorithm and method for preparing execution of a cryptographic algorithm
JP2009229716A (en) * 2008-03-21 2009-10-08 Seiko Epson Corp Integrated circuit device, electronic device and gray level characteristic data setting method
DE102009046518A1 (en) * 2009-11-09 2011-05-12 Robert Bosch Gmbh memory array
US9606929B2 (en) * 2011-11-08 2017-03-28 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Simulated NVRAM
US20130219125A1 (en) * 2012-02-21 2013-08-22 Microsoft Corporation Cache employing multiple page replacement algorithms
US8898376B2 (en) 2012-06-04 2014-11-25 Fusion-Io, Inc. Apparatus, system, and method for grouping data stored on an array of solid-state storage elements
US9043554B2 (en) * 2012-12-21 2015-05-26 Apple Inc. Cache policies for uncacheable memory requests
WO2015198600A1 (en) 2014-06-26 2015-12-30 日本電気株式会社 Analysis device, analysis method, and storage medium in which analysis program is recorded
US9846652B2 (en) * 2016-03-31 2017-12-19 Intel Corporation Technologies for region-biased cache management
US10127156B1 (en) * 2016-09-29 2018-11-13 EMC IP Holding Company LLC Caching techniques
WO2018100363A1 (en) 2016-11-29 2018-06-07 Arm Limited Memory address translation
US10289565B2 (en) 2017-05-31 2019-05-14 Apple Inc. Cache drop feature to increase memory bandwidth and save power
US10831673B2 (en) 2017-11-22 2020-11-10 Arm Limited Memory address translation
US10929308B2 (en) 2017-11-22 2021-02-23 Arm Limited Performing maintenance operations
US10866904B2 (en) * 2017-11-22 2020-12-15 Arm Limited Data storage for multiple data types
US10672099B2 (en) 2018-06-01 2020-06-02 Apple Inc. Memory cache management for graphics processing
CN109408429B (en) * 2018-11-01 2020-10-16 苏州浪潮智能科技有限公司 Caching method and device for low-speed interface
US11113207B2 (en) * 2018-12-26 2021-09-07 Samsung Electronics Co., Ltd. Bypass predictor for an exclusive last-level cache
US11609858B2 (en) * 2018-12-26 2023-03-21 Samsung Electronics Co., Ltd. Bypass predictor for an exclusive last-level cache

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0525308A1 (en) * 1991-07-31 1993-02-03 International Business Machines Corporation Memory map for processor cache macro
US5487162A (en) * 1992-02-25 1996-01-23 Matsushita Electric Industrial Co., Ltd. Cache lock information feeding system using an address translator
WO1996010789A1 (en) * 1994-09-30 1996-04-11 Intel Corporation Method and apparatus for processing memory-type information within a microprocessor

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5136691A (en) * 1988-01-20 1992-08-04 Advanced Micro Devices, Inc. Methods and apparatus for caching interlock variables in an integrated cache memory
DE68917326T2 (en) * 1988-01-20 1995-03-02 Advanced Micro Devices Inc Organization of an integrated cache memory for flexible use to support multiprocessor operations.
US5025366A (en) * 1988-01-20 1991-06-18 Advanced Micro Devices, Inc. Organization of an integrated cache unit for flexible usage in cache system design
US4905141A (en) * 1988-10-25 1990-02-27 International Business Machines Corporation Partitioned cache memory with partition look-aside table (PLAT) for early partition assignment identification
US5119485A (en) * 1989-05-15 1992-06-02 Motorola, Inc. Method for data bus snooping in a data processing system by selective concurrent read and invalidate cache operation
JPH0727488B2 (en) * 1989-06-19 1995-03-29 株式会社日立製作所 Memory access control method
US5233702A (en) * 1989-08-07 1993-08-03 International Business Machines Corporation Cache miss facility with stored sequences for data fetching
US5297270A (en) * 1989-11-13 1994-03-22 Zenith Data Systems Corporation Programmable cache memory which associates each section of main memory to be cached with a status bit which enables/disables the caching accessibility of the particular section, and with the capability of functioning with memory areas of varying size
US5257360A (en) * 1990-03-23 1993-10-26 Advanced Micro Devices,Inc. Re-configurable block length cache
JPH05108484A (en) * 1990-06-07 1993-04-30 Intel Corp Cache memory
US5353429A (en) * 1991-03-18 1994-10-04 Apple Computer, Inc. Cache memory systems that accesses main memory without wait states during cache misses, using a state machine and address latch in the memory controller
US5469555A (en) * 1991-12-19 1995-11-21 Opti, Inc. Adaptive write-back method and apparatus wherein the cache system operates in a combination of write-back and write-through modes for a cache-based microprocessor system
US5428761A (en) * 1992-03-12 1995-06-27 Digital Equipment Corporation System for achieving atomic non-sequential multi-word operations in shared memory
US5524212A (en) * 1992-04-27 1996-06-04 University Of Washington Multiprocessor system with write generate method for updating cache
US5353426A (en) * 1992-04-29 1994-10-04 Sun Microsystems, Inc. Cache miss buffer adapted to satisfy read requests to portions of a cache fill in progress without waiting for the cache fill to complete
US5381539A (en) * 1992-06-04 1995-01-10 Emc Corporation System and method for dynamically controlling cache management
US5410669A (en) * 1993-04-05 1995-04-25 Motorola, Inc. Data processor having a cache memory capable of being used as a linear ram bank
US5649153A (en) * 1995-06-19 1997-07-15 International Business Machines Corporation Aggressive adaption algorithm for selective record caching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0525308A1 (en) * 1991-07-31 1993-02-03 International Business Machines Corporation Memory map for processor cache macro
US5487162A (en) * 1992-02-25 1996-01-23 Matsushita Electric Industrial Co., Ltd. Cache lock information feeding system using an address translator
WO1996010789A1 (en) * 1994-09-30 1996-04-11 Intel Corporation Method and apparatus for processing memory-type information within a microprocessor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALLEN M S ET AL: "MULTIPROCESSING ASPECTS OF THE POWERPCTM 601", PROCEEDINGS OF THE SPRING COMPUTER SOCIETY INTERNATIONAL CONFERENCE (COMPCON), SAN FRANCISCO, FEB. 22 - 26, 1993, no. CONF. 38, 22 February 1993 (1993-02-22), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 117 - 126, XP000379037 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6856577B1 (en) * 1999-09-29 2005-02-15 Furuno Electric Company, Ltd. Method and apparatus for forming a reception beam

Also Published As

Publication number Publication date
US5915262A (en) 1999-06-22

Similar Documents

Publication Publication Date Title
US5915262A (en) Cache system and method using tagged cache lines for matching cache strategy to I/O application
US6119205A (en) Speculative cache line write backs to avoid hotspots
US6212602B1 (en) Cache tag caching
US6681311B2 (en) Translation lookaside buffer that caches memory type information
US6122709A (en) Cache with reduced tag information storage
US5793941A (en) On-chip primary cache testing circuit and test method
JP3587591B2 (en) Method of controlling cache miss and computer system thereof
US6105111A (en) Method and apparatus for providing a cache management technique
EP2430551B1 (en) Cache coherent support for flash in a memory hierarchy
US7194597B2 (en) Method and apparatus for sharing TLB entries
US6598128B1 (en) Microprocessor having improved memory management unit and cache memory
US6591340B2 (en) Microprocessor having improved memory management unit and cache memory
US5996048A (en) Inclusion vector architecture for a level two cache
EP0945805B1 (en) A cache coherency mechanism
US5826052A (en) Method and apparatus for concurrent access to multiple physical caches
US6370622B1 (en) Method and apparatus for curious and column caching
US8782348B2 (en) Microprocessor cache line evict array
US5715427A (en) Semi-associative cache with MRU/LRU replacement
US6629207B1 (en) Method for loading instructions or data into a locked way of a cache memory
CN107278298B (en) Cache maintenance instructions
US5765199A (en) Data processor with alocate bit and method of operation
CN113892090A (en) Multi-level cache security
JP3262519B2 (en) Method and system for enhancing processor memory performance by removing old lines in second level cache
US6766431B1 (en) Data processing system and method for a sector cache
JPH0721085A (en) Streaming cache for caching of data transferred between memory and i/o device and its method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 98506912

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase