US20050033922A1 - Embedded DRAM cache - Google Patents
Embedded DRAM cache Download PDFInfo
- Publication number
- US20050033922A1 US20050033922A1 US10/934,846 US93484604A US2005033922A1 US 20050033922 A1 US20050033922 A1 US 20050033922A1 US 93484604 A US93484604 A US 93484604A US 2005033922 A1 US2005033922 A1 US 2005033922A1
- Authority
- US
- United States
- Prior art keywords
- memory
- cache
- processor
- cache memory
- port
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
- G06F12/0897—Caches characterised by their organisation or structure with two or more cache hierarchy levels
Definitions
- the present invention relates generally to cache memory structures for a processor based system and, more particularly, to an apparatus that utilizes embedded dynamic random access memory (eDRAM) as a level three (L3) cache in the system chipset of a processor based system.
- eDRAM embedded dynamic random access memory
- processors to execute instructions has typically outpaced the ability of memory systems to supply the instructions and data to the processors. Due to the discrepancy in the operating speeds of the processors and system memory, the processor system's memory hierarchy plays a major role in determining the actual performance of the system. Most of today's memory hierarchies utilize cache memory in an attempt to minimize memory access latencies.
- Cache memory is used to provide faster access to frequently used instructions and data, which helps improve the overall performance of the system.
- Cache technology is based on the premise that programs frequently reuse the same instructions and data.
- a copy is usually saved in the cache memory (a cache tag is usually updated as well).
- the cache then monitors subsequent requests for data (and instructions) to see if the requested information has already been stored in the cache. If the data has been stored in the cache, it is delivered with low latency to the processor. If, on the other hand, the information is not in the cache, it must be fetched at a much higher latency from the system main memory.
- the first cache level, or level one (L1) cache is typically the fastest memory in the system and is usually integrated on the same chip as the processor.
- the L1 cache is faster because it is integrated with the processor, which avoids delays associated with transmitting information to, and receiving information from, an external chip.
- the lone caveat is that the L1 cache must be small (e.g., 32 Kb in the Intel® Pentium® III processor, 128 Kb in the AMD AthlonTM processor) since it resides on the same die as the processor.
- a second cache level, or level 2 (L2) cache is typically located on a different chip than the processor and has a larger capacity then the L1 cache (e.g., 512 Kb in the Intel® Pentium® III and AMD AthlonTM processors).
- the L2 cache is slower than the L1 cache, but because it is relatively close to the processor, it is still many times faster than the main system memory.
- small L2 cache memories have been placed on the same chip as the processor to speed up the performance of L2 cache memory accesses.
- processor systems consist of a processor with an on-chip L1 static random access memory (SRAM) cache and a separate off-chip L2 SRAM cache.
- SRAM static random access memory
- a small L2 SRAM cache has been moved onto the same chip as the processor and L1 cache, in which case the reduced latency is traded for a smaller L2 cache size.
- the size of the L1 cache has been increased by moving it onto a separate chip, thus trading off a larger L1 cache for increased latency and reduced bandwidth that result from off chip accesses.
- FIG. 1 illustrates a typical processor based system 10 having a memory hierarchy with two levels of cache memory.
- the system 10 includes a processor 20 having an on-board L1 cache 22 .
- the processor 20 is coupled to an off-chip or external L2 cache 24 .
- the system 10 includes a system chipset comprised of a north bridge 60 and a south bridge 80 . As known in the art, the chipset is the functional core of the system 10 .
- the bridges 60 , 80 are used to connect two or more busses and are responsible for routing information to and from the processor 20 and the other devices in the system 10 over the busses they are connected to.
- the north bridge 60 contains a PCI (peripheral component interconnect) to AGP (accelerated graphics port) interface 62 , a PCI to PCI interface 64 and a host to PCI interface 66 .
- the processor 20 is referred to as the host and is connected to the north bridge 60 via a host bus 30 .
- the system 10 includes a system memory 50 connected to the north bridge 60 via a memory bus 34 .
- the typical system 10 may also include an AGP device 52 , such as e.g., a graphics card, connected to the north bridge 60 via an AGP bus 32 .
- the typical system 10 may include a PCI device 56 connected to the north bridge 60 via a PCI bus 36 a .
- the north bridge 60 is typically connected to the south bridge 80 via a PCI bus 36 b .
- the PCI busses 36 a , 36 b may be individual busses or may be part of the same bus if so desired.
- the south bridge 80 usually contains a real-time clock (RTC) 82 , power management component 84 and the legacy components 86 (e.g., floppy disk controller and certain DMA (direct memory access) and CMOS (complimentary metal-oxide semiconductor) memory registers) of the system 10 .
- RTC real-time clock
- the legacy components 86 e.g., floppy disk controller and certain DMA (direct memory access) and CMOS (complimentary metal-oxide semiconductor) memory registers
- the south bridge 80 may also contain interrupt controllers, such as the input/output (I/O) APIC (advanced programmable interrupt controller).
- I/O input/output
- APIC advanced programmable interrupt controller
- the south bridge 80 may be connected to a USB (universal serial bus) device 92 via a USB bus 38 , an IDE (integrated drive electronics) device 90 via an IDE bus 40 , and/or an LPC (low pin count) device 94 via an LPC/ISA (industry standard architecture) bus 42 .
- the system's BIOS (basic input/output system) ROM 96 read only memory is also connected to the south bridge 80 via the LPC/ISA bus 42 .
- the BIOS ROM 96 contains, among other things, the set of instructions that initialize the processor 20 and other components in the system 10 .
- Examples of a USB device 92 include a scanner or a printer.
- Examples of an IDE device 90 include a floppy disk or hard drives and an examples of LPC devices 94 include various controllers and recording devices. It should be appreciated that the type of device connected to the south bridge 80 is system dependent.
- L3 cache third level of high speed cache memory
- memory access times are further compounded when other devices e.g., AGP device 52 or PCI device 56 are competing with the processor 20 by simultaneously requesting information from the cache and system memories. Accordingly, there is a desire and need for an L3 cache that allows several requesting devices to access its contents simultaneously.
- the present invention provides a third level of high speed cache memory (L3 cache) for a processor based system that is closer to the system processor with respect to the system memory, which reduces average memory latency and thus, increases system bandwidth and overall performance.
- L3 cache high speed cache memory
- the present invention also provides an L3 cache for a processor based system that is much larger than the L1 and L2 caches, yet does not substantially increase the size of the system.
- the present invention further provides an L3 cache for a processor based system that allows several requesting devices of the system to simultaneously access the contents of the L3 cache.
- the above and other features and advantages are achieved by a large L3 cache that is integrated within the system chipset.
- the L3 cache is comprised of multiple embedded memory cache arrays. Each array is accessible independently of each other, providing parallel access to the L3 cache. By placing the L3 cache within the chipset, it is closer to the system processor with respect to the system memory. By using independent arrays, the L3 cache can handle numerous simultaneous requests. This reduces average memory latency and thus, increases system bandwidth and overall performance.
- the L3 cache can be implemented on the chipset and be much larger than the L1 and L2 caches without substantially increasing the size of the chipset and system.
- FIG. 1 illustrates a typical processor based system having a memory hierarchy with two levels of cache memory
- FIG. 2 is a block diagram illustrating a portion of a processor based system having an eDRAM L3 cache integrated on the system chipset constructed in accordance with an exemplary embodiment of the present invention
- FIG. 3 is a block diagram illustrating an exemplary eDRAM cache utilized in the system illustrated in FIG. 2 .
- FIG. 2 illustrates a portion of a processor based system 110 having an eDRAM L3 cache 200 integrated on the system chipset constructed in accordance with an exemplary embodiment of the present invention.
- the system 110 includes a south bridge 80 and a north bridge 160 .
- the south bridge 80 is connected to a north bridge 160 via a bus such as a PCI bus 36 .
- the north and south bridges comprise the system chipset for the system 110 .
- the system 110 also includes the typical components connected to the south bridge 80 as illustrated in FIG. 1 .
- the south bridge components are not illustrated solely for clarity purposes of FIG. 2 .
- the L3 cache 200 is integrated on the north bridge 160 of the system chipset. As such, the L3 cache is positioned closer to the processor 120 in comparison to the system memory 50 . For example, the processor 120 can access the L3 cache 200 without having to send or receive information over the memory bus 34 (it only has to send/receive information over the host bus 30 ). As will become apparent from the following description, the L3 cache 200 is comprised of multiple independent arrays, which allows multiple devices (e.g., device 52 , processor 120 ) to access the cache 200 at the same time. Furthermore, in a preferred embodiment, the L3 cache 200 comprises eDRAM arrays, which allows it to be larger then the L1 and L2 caches 122 , 124 without substantially increasing the size of the system chipset.
- the north bridge 160 is also connected to a graphics device/unit 52 via an AGP bus 32 , the system memory 50 via the memory bus 34 , and the processor 120 via the host bus 30 .
- the processor 120 contains on-board or integrated L1 and L2 caches 122 , 124 .
- the L1 cache 122 may be e.g., 128 Kb and the L2 cache 124 may be e.g., 512 Kb.
- the size of the L1 and L2 caches 122 , 124 is purely exemplary and is not important to practice the present invention. Thus, the invention is not to be limited to particular sizes of the L1 and L2 caches 122 , 124 . All that is required to practice the invention is that the pertinent system components (e.g., processor 120 ) realize that the memory hierarchy comprises three levels of cache and that the L3 cache 200 is integrated on the system chipset and is constructed as described below.
- the north bridge 160 may include an AGP interface 162 , a memory controller 168 , PCI interface 166 and processor interface 170 .
- the L3 cache 200 , AGP interface 162 , memory controller 168 , PCI interface 166 and processor interface 170 are each coupled to a switch 172 , which allows information to be passed between these components and the outside devices and buses.
- the L3 cache 200 is tied directly to the processor interface 170 , which reduces the latency of accesses to the cache 200 .
- the L3 cache 200 is comprised of multiple independent eDRAM arrays, which allows multiple devices to access the L3 cache 200 at the same time. Moreover, by using eDRAM arrays, the L3 cache 200 is much smaller than the typical SRAM implemented cache.
- the L3 cache 200 is described in detail with respect to FIG. 3 .
- FIG. 3 is a block diagram illustrating an exemplary L3 cache 200 utilized in the system 110 of FIG. 2 .
- the L3 cache 200 is shown as being directly connected to the system memory 50 . It should be appreciated, however, that the L3 cache 200 is connected to the system memory 50 through the switch 172 , memory controller 168 and memory bus 34 as shown in FIG. 1 or by any other arrangement deemed suitable for this connection.
- the L3 cache 200 comprises a plurality of eDRAM arrays 210 a , 210 b , 210 c , 210 d (collectively referred to herein as “eDRAM arrays 210 ”).
- FIG. 3 illustrates four eDRAM arrays 210 , it should be appreciated that any number of arrays 210 can be used to practice the invention and the number of arrays 210 is application specific.
- the L3 cache 200 includes eight independent one Mb eDRAM arrays 210 , with each array 210 being 128 bits wide.
- the L3 cache 200 size is eight Mb, which is substantially larger than the L1 and L2 cache sizes of 128 Kb and 512 Kb, respectively.
- each array 210 a , 210 b , 210 c , 210 d have its own local memory controller 212 a , 212 b , 212 c , 212 d (collectively referred to herein as “controllers 212 ”).
- the controller 212 include logic to access the arrays 210 and to perform DRAM operations such as e.g., refresh.
- the L3 cache 200 is a direct mapped cache, with each array 210 a , 210 b , 210 c , 210 d being associated with a respective tag array 214 a , 214 b , 214 c , 214 d (collectively referred to herein as “tag arrays 214 ”).
- the tag arrays 214 may be implemented with eDRAM also, but other types of memory may be used if desired.
- Each entry in the cache 200 is accessed by an address tag stored in the tag arrays 214 .
- each main memory address maps to a unique location within the cache.
- the addresses from the system memory 50 are given unique addresses in the L3 cache 200 . Because each array 210 a , 210 b , 210 c , 210 d has its own controller 212 a , 212 b , 212 c , 212 d and tag array 214 a , 214 b , 214 c , 214 d , they are independently accessible.
- the L3 cache 200 comprises a plurality of independent direct mapped caches. It should be appreciated that the L3 cache 200 could be configured to be a fully associative (i.e., main memory addresses can correspond to any cache location) or set associative (i.e., each address tag corresponds to a set of cache location) cache memory if so desired and if space is available on the chipset.
- a fully associative i.e., main memory addresses can correspond to any cache location
- set associative i.e., each address tag corresponds to a set of cache location
- a master scheduler 202 is connected to the eDRAM arrays 210 and servers as the controller of the cache 200 . Multiple requests REQ are allowed to enter the master scheduler 202 , which is responsible for resolving resource conflicts within the cache 200 .
- the scheduler 202 serves as a cross-bar controller for the multiple requestors trying to gain access into the cache 200 and for the eDRAM arrays 210 trying to output information to the requesters.
- the use of independent arrays 210 and the scheduler 202 reduces bank conflict and read/write turnarounds.
- the arrays 210 also allow for multiple pages of memory to be kept open, which also reduces latency. Moreover, traffic from several I/O streams, AGP devices, the processor, etc. can be handled concurrently.
- a tag lookup determines if there is a cache hit or miss. If there is a cache hit, the local controller 212 accesses the associated eDRAM array 210 and outputs the data to the scheduler 202 . The master scheduler 202 then routes the data to the appropriate requestor. Thus, the architecture of the cache 200 maximizes system through put. If, on the other hand, a cache miss is detected, the request REQ is forwarded to the system memory 50 . The data is returned from the system memory 50 and a cache tag update is scheduled.
- the L3 cache 200 will implement cache replacement (triggered by a cache replacement request CACHE REPLACEMENT) and eviction methods when needed. Any method of performing cache replacement and eviction can be utilized by the present invention and thus, the invention should not be limited in any way to any particular method for doing so.
- the present invention provides a large L3 cache 200 that is integrated within the system chipset.
- the cache 200 is integrated on the same chip as the north bridge 160 .
- the L3 cache 200 is comprised of multiple embedded memory cache arrays 210 . Each array 210 is accessible independently of each other, providing parallel access to the L3 cache 200 . By placing the L3 cache 200 within the chipset, it is closer to the system processor 120 with respect to the system memory 50 . By using independent arrays 210 , the L3 cache 200 can handle numerous simultaneous requests REQ. These features reduce average memory latency and thus, increase system bandwidth and overall performance.
- the L3 cache 200 can be implemented on the chipset and be much larger than the L1 and L2 caches 122 , 124 without substantially increasing the size of the chipset and system 110 .
- the L3 cache 200 is eight Mbytes of eDRAM and is constructed of eight independent one Mbyte eDRAM arrays 210 .
- Each array 210 for example, can be 128 bits wide and operate at a 200 MHz, which means that each array 210 can provide 3.2 giga-bytes of information per second
- Each array 210 has its own local memory controller 212 .
- One central controller manages conflicts between arrays.
- the L3 cache 200 is directly mapped with a tag array 214 associated with each eDRAM array 210 . This allows for independent tag look ups for each array 210 , which allows for multiple requestors to access the cache 200 concurrently.
- the independent arrays 210 also reduce bank conflict and read/write turnarounds. Thus, traffic to/from multiple requestors can be handled concurrently.
- the present invention has been described using eDRAM arrays 210 because substantially larger eDRAM arrays can be implemented in the system chipset in comparison to other types of memory (e.g., SRAM). It should be noted that other types of embedded memory could be used in the L3 cache 200 if desired. Although the invention has been described using an eight Mb L3 cache 200 with 1 Mb eDRAM arrays 210 , the L3 cache 200 of present invention may be sixteen Mb or any other size that would increase the performance of the system without adversely impacting the size of the chipset. The sizes of the arrays 210 may also be modified as required to increase the performance of the system. Furthermore, although described with reference to a single processor 120 , the above invention may be implemented in a multiple processor system.
Abstract
A large level three (L3) cache is integrated within the system chipset. The L3 cache is comprised of multiple embedded memory cache arrays. Each array is accessible independently of each other, providing parallel access to the L3 cache. By placing the L3 cache within the chipset, it is closer to the system processor with respect to the system memory. By using independent arrays, the L3 cache can handle numerous simultaneous requests. This reduces average memory latency and thus, increases system bandwidth and overall performance. By using embedded memory, the L3 cache can be implemented on the chipset and be much larger than the L1 and L2 caches without substantially increasing the size of the chipset and system.
Description
- The present invention relates generally to cache memory structures for a processor based system and, more particularly, to an apparatus that utilizes embedded dynamic random access memory (eDRAM) as a level three (L3) cache in the system chipset of a processor based system.
- The ability of processors to execute instructions has typically outpaced the ability of memory systems to supply the instructions and data to the processors. Due to the discrepancy in the operating speeds of the processors and system memory, the processor system's memory hierarchy plays a major role in determining the actual performance of the system. Most of today's memory hierarchies utilize cache memory in an attempt to minimize memory access latencies.
- Cache memory is used to provide faster access to frequently used instructions and data, which helps improve the overall performance of the system. Cache technology is based on the premise that programs frequently reuse the same instructions and data. When data is read from main memory, a copy is usually saved in the cache memory (a cache tag is usually updated as well). The cache then monitors subsequent requests for data (and instructions) to see if the requested information has already been stored in the cache. If the data has been stored in the cache, it is delivered with low latency to the processor. If, on the other hand, the information is not in the cache, it must be fetched at a much higher latency from the system main memory.
- In more advanced processor based systems, there are multiple levels (usually two levels) of cache memory. The levels are organized such that a small amount of very high speed memory is placed close to the processor while denser, slower memory is placed further away. In the memory hierarchy, the closer to the processor that the data resides, the higher the performance of the memory and the overall system. When data is not found in the highest level of the hierarchy and a miss occurs, the data must be accessed from a lower level of the memory hierarchy. Since each level contains increased amounts of storage, the probability increases that the data will be found. However, each level typically increases the latency or number of cycles it takes to transfer the data to the processor.
- The first cache level, or level one (L1) cache, is typically the fastest memory in the system and is usually integrated on the same chip as the processor. The L1 cache is faster because it is integrated with the processor, which avoids delays associated with transmitting information to, and receiving information from, an external chip. The lone caveat is that the L1 cache must be small (e.g., 32 Kb in the Intel® Pentium® III processor, 128 Kb in the AMD Athlon™ processor) since it resides on the same die as the processor.
- A second cache level, or level 2 (L2) cache, is typically located on a different chip than the processor and has a larger capacity then the L1 cache (e.g., 512 Kb in the Intel® Pentium® III and AMD Athlon™ processors). The L2 cache is slower than the L1 cache, but because it is relatively close to the processor, it is still many times faster than the main system memory. Recently, small L2 cache memories have been placed on the same chip as the processor to speed up the performance of L2 cache memory accesses.
- Many current processor systems consist of a processor with an on-chip L1 static random access memory (SRAM) cache and a separate off-chip L2 SRAM cache. In some systems, a small L2 SRAM cache has been moved onto the same chip as the processor and L1 cache, in which case the reduced latency is traded for a smaller L2 cache size. In other systems, the size of the L1 cache has been increased by moving it onto a separate chip, thus trading off a larger L1 cache for increased latency and reduced bandwidth that result from off chip accesses. These options are attempts to achieve the highest system performance by optimizing the memory hierarchy. In each case, various tradeoffs between size, latency, and bandwidth are made in an attempt to deal with the conflicting requirements of obtaining more, faster, and closer memory.
-
FIG. 1 illustrates a typical processor basedsystem 10 having a memory hierarchy with two levels of cache memory. Thesystem 10 includes aprocessor 20 having an on-board L1 cache 22. Theprocessor 20 is coupled to an off-chip orexternal L2 cache 24. Thesystem 10 includes a system chipset comprised of anorth bridge 60 and asouth bridge 80. As known in the art, the chipset is the functional core of thesystem 10. As will be described below, thebridges processor 20 and the other devices in thesystem 10 over the busses they are connected to. - The
north bridge 60 contains a PCI (peripheral component interconnect) to AGP (accelerated graphics port) interface 62, a PCI toPCI interface 64 and a host to PCI interface 66. Typically, theprocessor 20 is referred to as the host and is connected to thenorth bridge 60 via ahost bus 30. Thesystem 10 includes asystem memory 50 connected to thenorth bridge 60 via amemory bus 34. Thetypical system 10 may also include anAGP device 52, such as e.g., a graphics card, connected to thenorth bridge 60 via an AGPbus 32. Furthermore, thetypical system 10 may include aPCI device 56 connected to thenorth bridge 60 via aPCI bus 36 a. - The
north bridge 60 is typically connected to thesouth bridge 80 via aPCI bus 36 b. ThePCI busses south bridge 80 usually contains a real-time clock (RTC) 82,power management component 84 and the legacy components 86 (e.g., floppy disk controller and certain DMA (direct memory access) and CMOS (complimentary metal-oxide semiconductor) memory registers) of thesystem 10. Although not illustrated, thesouth bridge 80 may also contain interrupt controllers, such as the input/output (I/O) APIC (advanced programmable interrupt controller). - The south
bridge 80 may be connected to a USB (universal serial bus)device 92 via aUSB bus 38, an IDE (integrated drive electronics)device 90 via an IDEbus 40, and/or an LPC (low pin count)device 94 via an LPC/ISA (industry standard architecture)bus 42. The system's BIOS (basic input/output system) ROM 96 (read only memory) is also connected to thesouth bridge 80 via the LPC/ISA bus 42. TheBIOS ROM 96 contains, among other things, the set of instructions that initialize theprocessor 20 and other components in thesystem 10. Examples of aUSB device 92 include a scanner or a printer. Examples of anIDE device 90 include a floppy disk or hard drives and an examples ofLPC devices 94 include various controllers and recording devices. It should be appreciated that the type of device connected to thesouth bridge 80 is system dependent. - As can be seen from
FIG. 1 , when theprocessor 20 can not access information from one of the twocaches system memory 50. This means that at least twobuses north bridge 60 must be involved to access the information from thesystem memory 50, which increases the latency of the access. Increased latency reduces the system bandwidth and overall performance. Accordingly, there is a desire and need for a third level of high speed cache memory (“L3 cache”) that is closer to theprocessor 20 with respect to thesystem memory 50. Moreover, it is desirable that the L3 cache be much larger than the L1 andL2 caches system 10. - Additionally, it should be noted that memory access times are further compounded when other devices e.g.,
AGP device 52 orPCI device 56 are competing with theprocessor 20 by simultaneously requesting information from the cache and system memories. Accordingly, there is a desire and need for an L3 cache that allows several requesting devices to access its contents simultaneously. - The present invention provides a third level of high speed cache memory (L3 cache) for a processor based system that is closer to the system processor with respect to the system memory, which reduces average memory latency and thus, increases system bandwidth and overall performance.
- The present invention also provides an L3 cache for a processor based system that is much larger than the L1 and L2 caches, yet does not substantially increase the size of the system.
- The present invention further provides an L3 cache for a processor based system that allows several requesting devices of the system to simultaneously access the contents of the L3 cache.
- The above and other features and advantages are achieved by a large L3 cache that is integrated within the system chipset. The L3 cache is comprised of multiple embedded memory cache arrays. Each array is accessible independently of each other, providing parallel access to the L3 cache. By placing the L3 cache within the chipset, it is closer to the system processor with respect to the system memory. By using independent arrays, the L3 cache can handle numerous simultaneous requests. This reduces average memory latency and thus, increases system bandwidth and overall performance. By using embedded memory, the L3 cache can be implemented on the chipset and be much larger than the L1 and L2 caches without substantially increasing the size of the chipset and system.
- The foregoing and other advantages and features of the invention will become more apparent from the detailed description of exemplary embodiments provided below with reference to the accompanying drawings in which:
-
FIG. 1 illustrates a typical processor based system having a memory hierarchy with two levels of cache memory; -
FIG. 2 is a block diagram illustrating a portion of a processor based system having an eDRAM L3 cache integrated on the system chipset constructed in accordance with an exemplary embodiment of the present invention; and -
FIG. 3 is a block diagram illustrating an exemplary eDRAM cache utilized in the system illustrated inFIG. 2 . -
FIG. 2 illustrates a portion of a processor basedsystem 110 having aneDRAM L3 cache 200 integrated on the system chipset constructed in accordance with an exemplary embodiment of the present invention. Thesystem 110 includes asouth bridge 80 and a north bridge 160. Thesouth bridge 80 is connected to a north bridge 160 via a bus such as aPCI bus 36. The north and south bridges comprise the system chipset for thesystem 110. Although not illustrated, thesystem 110 also includes the typical components connected to thesouth bridge 80 as illustrated inFIG. 1 . The south bridge components are not illustrated solely for clarity purposes ofFIG. 2 . - In the illustrated embodiment, the
L3 cache 200 is integrated on the north bridge 160 of the system chipset. As such, the L3 cache is positioned closer to theprocessor 120 in comparison to thesystem memory 50. For example, theprocessor 120 can access theL3 cache 200 without having to send or receive information over the memory bus 34 (it only has to send/receive information over the host bus 30). As will become apparent from the following description, theL3 cache 200 is comprised of multiple independent arrays, which allows multiple devices (e.g.,device 52, processor 120) to access thecache 200 at the same time. Furthermore, in a preferred embodiment, theL3 cache 200 comprises eDRAM arrays, which allows it to be larger then the L1 andL2 caches - The north bridge 160 is also connected to a graphics device/
unit 52 via anAGP bus 32, thesystem memory 50 via thememory bus 34, and theprocessor 120 via thehost bus 30. In the illustrated embodiment, theprocessor 120 contains on-board or integrated L1 andL2 caches L1 cache 122 may be e.g., 128 Kb and theL2 cache 124 may be e.g., 512 Kb. It should be appreciated that the size of the L1 andL2 caches L2 caches L3 cache 200 is integrated on the system chipset and is constructed as described below. - In addition to the
L3 cache 200, the north bridge 160 may include anAGP interface 162, amemory controller 168,PCI interface 166 andprocessor interface 170. TheL3 cache 200,AGP interface 162,memory controller 168,PCI interface 166 andprocessor interface 170 are each coupled to aswitch 172, which allows information to be passed between these components and the outside devices and buses. TheL3 cache 200 is tied directly to theprocessor interface 170, which reduces the latency of accesses to thecache 200. As noted above, theL3 cache 200 is comprised of multiple independent eDRAM arrays, which allows multiple devices to access theL3 cache 200 at the same time. Moreover, by using eDRAM arrays, theL3 cache 200 is much smaller than the typical SRAM implemented cache. TheL3 cache 200 is described in detail with respect toFIG. 3 . -
FIG. 3 is a block diagram illustrating anexemplary L3 cache 200 utilized in thesystem 110 ofFIG. 2 . For illustration purposes only, theL3 cache 200 is shown as being directly connected to thesystem memory 50. It should be appreciated, however, that theL3 cache 200 is connected to thesystem memory 50 through theswitch 172,memory controller 168 andmemory bus 34 as shown inFIG. 1 or by any other arrangement deemed suitable for this connection. - The
L3 cache 200 comprises a plurality of eDRAM arrays 210 a, 210 b, 210 c, 210 d (collectively referred to herein as “eDRAM arrays 210”). AlthoughFIG. 3 illustrates four eDRAM arrays 210, it should be appreciated that any number of arrays 210 can be used to practice the invention and the number of arrays 210 is application specific. In one desired embodiment, theL3 cache 200 includes eight independent one Mb eDRAM arrays 210, with each array 210 being 128 bits wide. Thus, in one embodiment, theL3 cache 200 size is eight Mb, which is substantially larger than the L1 and L2 cache sizes of 128 Kb and 512 Kb, respectively. - It is desired that each array 210 a, 210 b, 210 c, 210 d have its own local memory controller 212 a, 212 b, 212 c, 212 d (collectively referred to herein as “controllers 212”). The controller 212 include logic to access the arrays 210 and to perform DRAM operations such as e.g., refresh. In one embodiment, the
L3 cache 200 is a direct mapped cache, with each array 210 a, 210 b, 210 c, 210 d being associated with a respective tag array 214 a, 214 b, 214 c, 214 d (collectively referred to herein as “tag arrays 214”). The tag arrays 214 may be implemented with eDRAM also, but other types of memory may be used if desired. - Each entry in the
cache 200 is accessed by an address tag stored in the tag arrays 214. As is known in the art, in a direct mapped cache, each main memory address maps to a unique location within the cache. Thus, if theL3 cache 200 is implemented as a direct mapped cache, the addresses from thesystem memory 50 are given unique addresses in theL3 cache 200. Because each array 210 a, 210 b, 210 c, 210 d has its own controller 212 a, 212 b, 212 c, 212 d and tag array 214 a, 214 b, 214 c, 214 d, they are independently accessible. Essentially, theL3 cache 200 comprises a plurality of independent direct mapped caches. It should be appreciated that theL3 cache 200 could be configured to be a fully associative (i.e., main memory addresses can correspond to any cache location) or set associative (i.e., each address tag corresponds to a set of cache location) cache memory if so desired and if space is available on the chipset. - A
master scheduler 202 is connected to the eDRAM arrays 210 and servers as the controller of thecache 200. Multiple requests REQ are allowed to enter themaster scheduler 202, which is responsible for resolving resource conflicts within thecache 200. In essence, thescheduler 202 serves as a cross-bar controller for the multiple requestors trying to gain access into thecache 200 and for the eDRAM arrays 210 trying to output information to the requesters. The use of independent arrays 210 and thescheduler 202 reduces bank conflict and read/write turnarounds. The arrays 210 also allow for multiple pages of memory to be kept open, which also reduces latency. Moreover, traffic from several I/O streams, AGP devices, the processor, etc. can be handled concurrently. - In operation, when a request REQ is received and a given eDRAM array 210 is free, a tag lookup determines if there is a cache hit or miss. If there is a cache hit, the local controller 212 accesses the associated eDRAM array 210 and outputs the data to the
scheduler 202. Themaster scheduler 202 then routes the data to the appropriate requestor. Thus, the architecture of thecache 200 maximizes system through put. If, on the other hand, a cache miss is detected, the request REQ is forwarded to thesystem memory 50. The data is returned from thesystem memory 50 and a cache tag update is scheduled. - It should be noted that the
L3 cache 200 will implement cache replacement (triggered by a cache replacement request CACHE REPLACEMENT) and eviction methods when needed. Any method of performing cache replacement and eviction can be utilized by the present invention and thus, the invention should not be limited in any way to any particular method for doing so. - Thus, referring to
FIGS. 2 and 3 , the present invention provides alarge L3 cache 200 that is integrated within the system chipset. In a preferred embodiment, thecache 200 is integrated on the same chip as the north bridge 160. TheL3 cache 200 is comprised of multiple embedded memory cache arrays 210. Each array 210 is accessible independently of each other, providing parallel access to theL3 cache 200. By placing theL3 cache 200 within the chipset, it is closer to thesystem processor 120 with respect to thesystem memory 50. By using independent arrays 210, theL3 cache 200 can handle numerous simultaneous requests REQ. These features reduce average memory latency and thus, increase system bandwidth and overall performance. By using embedded memory, theL3 cache 200 can be implemented on the chipset and be much larger than the L1 andL2 caches system 110. - In one exemplary embodiment, the
L3 cache 200 is eight Mbytes of eDRAM and is constructed of eight independent one Mbyte eDRAM arrays 210. Each array 210, for example, can be 128 bits wide and operate at a 200 MHz, which means that each array 210 can provide 3.2 giga-bytes of information per second Each array 210 has its own local memory controller 212. One central controller manages conflicts between arrays. TheL3 cache 200 is directly mapped with a tag array 214 associated with each eDRAM array 210. This allows for independent tag look ups for each array 210, which allows for multiple requestors to access thecache 200 concurrently. The independent arrays 210 also reduce bank conflict and read/write turnarounds. Thus, traffic to/from multiple requestors can be handled concurrently. - The present invention has been described using eDRAM arrays 210 because substantially larger eDRAM arrays can be implemented in the system chipset in comparison to other types of memory (e.g., SRAM). It should be noted that other types of embedded memory could be used in the
L3 cache 200 if desired. Although the invention has been described using an eightMb L3 cache 200 with 1 Mb eDRAM arrays 210, theL3 cache 200 of present invention may be sixteen Mb or any other size that would increase the performance of the system without adversely impacting the size of the chipset. The sizes of the arrays 210 may also be modified as required to increase the performance of the system. Furthermore, although described with reference to asingle processor 120, the above invention may be implemented in a multiple processor system. - While the invention has been described and illustrated with reference to exemplary embodiments, many variations can be made and equivalents substituted without departing from the spirit or scope of the invention. Accordingly, the invention is not to be understood as being limited by the foregoing description, but is only limited by the scope of the appended claims.
Claims (13)
1-28. (cancelled)
29. An integrated circuit for use with a processor system, said integrated circuit comprising:
a memory controller for controlling a system memory of the processor system;
a processor interface coupled to said memory controller, said processor interface for transmitting information between a processor and the system memory;
a cache memory coupled to said memory controller and said processor interface, said cache memory comprising a plurality of independently accessible memory arrays; and
a master scheduler coupled to each of said memory arrays, said master scheduler for processing requests for information from the processor and other components by independently and concurrently forwarding information, and resolving conflicts for information, between said memory arrays, said processor, and said other components.
30. The integrated circuit of claim 29 , wherein each one of said memory arrays comprises embedded memory.
31. The integrated circuit of claim 29 , wherein said integrated circuit include eight memory arrays.
32. The integrated circuit of claim 29 , wherein said memory arrays are operated to as a direct mapped cache memory.
33. The integrated circuit of claim 29 , wherein said cache memory forms a lower level cache memory system in combination with at least one off-chip higher level cache memory.
34. The integrated circuit of claim 33 , wherein said cache memory forms a third level cache memory of a cache memory system comprising a first level cache memory and a second level cache memory of a processor coupled to said integrated circuit.
35. A multi-port bus bridge, formed on an integrated circuit, the bus bridge comprising:
a communication network;
a cache memory, coupled to said communication network, said cache memory comprising a plurality of independently and concurrently accessible units, each of said units comprising an embedded unit memory and a unit memory controller;
a plurality of ports, for communicating with respective off-chip devices;
a plurality of port controllers, each associated with a respective one of said plurality of ports, each of said port controllers coupled to said communication network and for transmitting information among said plurality of ports and said cache memory;
a bridge controller, coupled to said plurality of port controllers and said cache memory, said bridge controller processing requests received over said plurality of ports to resolve conflicts between said units to permit information to be concurrently and independently forwarded between said port controllers and said units.
36. The multi-port bridge of claim 35 , wherein cache memory is a direct mapped cache.
37. The multi-port bridge of claim 35 , wherein each said embedded unit memory comprises a tag array and a cache storage array.
38. The multi-port bridge of claim 35 , wherein said plurality of ports comprise:
a processor port, for coupling to a processor of a processor based system;
a memory port, for coupling to a system memory of said processor based system; and
an expansion bus port, for coupling to an expansion bus of said processor based system.
39. The multi-port bridge of claim 35 , wherein multi-port bridge is a north bridge of a computer system.
40. The multi-port bridge of claim 38 , wherein said cache memory is one of a plurality of cache memories in a cache memory system of said processor based system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/934,846 US20050033922A1 (en) | 2001-07-13 | 2004-09-07 | Embedded DRAM cache |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/903,624 US6789168B2 (en) | 2001-07-13 | 2001-07-13 | Embedded DRAM cache |
US10/934,846 US20050033922A1 (en) | 2001-07-13 | 2004-09-07 | Embedded DRAM cache |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/903,624 Continuation US6789168B2 (en) | 2001-07-13 | 2001-07-13 | Embedded DRAM cache |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050033922A1 true US20050033922A1 (en) | 2005-02-10 |
Family
ID=25417807
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/903,624 Expired - Lifetime US6789168B2 (en) | 2001-07-13 | 2001-07-13 | Embedded DRAM cache |
US10/934,846 Abandoned US20050033922A1 (en) | 2001-07-13 | 2004-09-07 | Embedded DRAM cache |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/903,624 Expired - Lifetime US6789168B2 (en) | 2001-07-13 | 2001-07-13 | Embedded DRAM cache |
Country Status (1)
Country | Link |
---|---|
US (2) | US6789168B2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8195858B1 (en) * | 2009-07-28 | 2012-06-05 | Nvidia Corporation | Managing conflicts on shared L2 bus |
US8321618B1 (en) | 2009-07-28 | 2012-11-27 | Nvidia Corporation | Managing conflicts on shared L2 bus |
US9305616B2 (en) | 2012-07-17 | 2016-04-05 | Samsung Electronics Co., Ltd. | Semiconductor memory cell array having fast array area and semiconductor memory including the same |
US9384092B2 (en) | 2013-06-26 | 2016-07-05 | Samsung Electronics Co., Ltd. | Semiconductor memory device with multiple sub-memory cell arrays and memory system including same |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002532717A (en) | 1998-12-11 | 2002-10-02 | サイミックス テクノロジーズ、インク | Sensor array based system and method for rapid material characterization |
US6789169B2 (en) * | 2001-10-04 | 2004-09-07 | Micron Technology, Inc. | Embedded DRAM cache memory and method having reduced latency |
US20040225830A1 (en) * | 2003-05-06 | 2004-11-11 | Eric Delano | Apparatus and methods for linking a processor and cache |
US7844801B2 (en) * | 2003-07-31 | 2010-11-30 | Intel Corporation | Method and apparatus for affinity-guided speculative helper threads in chip multiprocessors |
US7167934B1 (en) * | 2003-09-09 | 2007-01-23 | Microsoft Corporation | Peripheral device data transfer protocol |
EP1714333A2 (en) * | 2004-01-06 | 2006-10-25 | Cymbet Corporation | Layered barrier structure having one or more definable layers and method |
US7123521B1 (en) * | 2005-04-27 | 2006-10-17 | Micron Technology, Inc. | Random cache read |
DE102006059744A1 (en) * | 2006-12-18 | 2008-06-19 | Qimonda Ag | Semiconductor memory device with redundant memory cells, and method for operating a semiconductor memory device |
US8161243B1 (en) * | 2007-09-28 | 2012-04-17 | Intel Corporation | Address translation caching and I/O cache performance improvement in virtualized environments |
CN103827776B (en) | 2011-09-30 | 2017-11-07 | 英特尔公司 | The active-state power management of power consumption is reduced by PCI high-speed assemblies(ASPM) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4169284A (en) * | 1978-03-07 | 1979-09-25 | International Business Machines Corporation | Cache control for concurrent access |
US4442487A (en) * | 1981-12-31 | 1984-04-10 | International Business Machines Corporation | Three level memory hierarchy using write and share flags |
US5675765A (en) * | 1992-04-29 | 1997-10-07 | Sun Microsystems, Inc. | Cache memory system with independently accessible subdivided cache tag arrays |
US5737569A (en) * | 1993-06-30 | 1998-04-07 | Intel Corporation | Multiport high speed memory having contention arbitration capability without standby delay |
US5829026A (en) * | 1994-11-22 | 1998-10-27 | Monolithic System Technology, Inc. | Method and structure for implementing a cache memory using a DRAM array |
US5895487A (en) * | 1996-11-13 | 1999-04-20 | International Business Machines Corporation | Integrated processing and L2 DRAM cache |
US6006310A (en) * | 1995-09-20 | 1999-12-21 | Micron Electronics, Inc. | Single memory device that functions as a multi-way set associative cache memory |
US6018792A (en) * | 1997-07-02 | 2000-01-25 | Micron Electronics, Inc. | Apparatus for performing a low latency memory read with concurrent snoop |
US6073212A (en) * | 1997-09-30 | 2000-06-06 | Sun Microsystems, Inc. | Reducing bandwidth and areas needed for non-inclusive memory hierarchy by using dual tags |
US6122709A (en) * | 1997-12-19 | 2000-09-19 | Sun Microsystems, Inc. | Cache with reduced tag information storage |
US6128700A (en) * | 1995-05-17 | 2000-10-03 | Monolithic System Technology, Inc. | System utilizing a DRAM array as a next level cache memory and method for operating same |
US6195729B1 (en) * | 1998-02-17 | 2001-02-27 | International Business Machines Corporation | Deallocation with cache update protocol (L2 evictions) |
US6208273B1 (en) * | 1999-01-29 | 2001-03-27 | Interactive Silicon, Inc. | System and method for performing scalable embedded parallel data compression |
US20020004823A1 (en) * | 2000-07-06 | 2002-01-10 | Anderson Marquette John | Multi-processor system verification circuitry |
US6353569B1 (en) * | 1995-08-31 | 2002-03-05 | Hitachi, Ltd. | Semiconductor memory device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3348367B2 (en) * | 1995-12-06 | 2002-11-20 | 富士通株式会社 | Multiple access method and multiple access cache memory device |
-
2001
- 2001-07-13 US US09/903,624 patent/US6789168B2/en not_active Expired - Lifetime
-
2004
- 2004-09-07 US US10/934,846 patent/US20050033922A1/en not_active Abandoned
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4169284A (en) * | 1978-03-07 | 1979-09-25 | International Business Machines Corporation | Cache control for concurrent access |
US4442487A (en) * | 1981-12-31 | 1984-04-10 | International Business Machines Corporation | Three level memory hierarchy using write and share flags |
US5675765A (en) * | 1992-04-29 | 1997-10-07 | Sun Microsystems, Inc. | Cache memory system with independently accessible subdivided cache tag arrays |
US5737569A (en) * | 1993-06-30 | 1998-04-07 | Intel Corporation | Multiport high speed memory having contention arbitration capability without standby delay |
US5829026A (en) * | 1994-11-22 | 1998-10-27 | Monolithic System Technology, Inc. | Method and structure for implementing a cache memory using a DRAM array |
US6128700A (en) * | 1995-05-17 | 2000-10-03 | Monolithic System Technology, Inc. | System utilizing a DRAM array as a next level cache memory and method for operating same |
US6353569B1 (en) * | 1995-08-31 | 2002-03-05 | Hitachi, Ltd. | Semiconductor memory device |
US6006310A (en) * | 1995-09-20 | 1999-12-21 | Micron Electronics, Inc. | Single memory device that functions as a multi-way set associative cache memory |
US5895487A (en) * | 1996-11-13 | 1999-04-20 | International Business Machines Corporation | Integrated processing and L2 DRAM cache |
US6018792A (en) * | 1997-07-02 | 2000-01-25 | Micron Electronics, Inc. | Apparatus for performing a low latency memory read with concurrent snoop |
US6073212A (en) * | 1997-09-30 | 2000-06-06 | Sun Microsystems, Inc. | Reducing bandwidth and areas needed for non-inclusive memory hierarchy by using dual tags |
US6122709A (en) * | 1997-12-19 | 2000-09-19 | Sun Microsystems, Inc. | Cache with reduced tag information storage |
US6195729B1 (en) * | 1998-02-17 | 2001-02-27 | International Business Machines Corporation | Deallocation with cache update protocol (L2 evictions) |
US6208273B1 (en) * | 1999-01-29 | 2001-03-27 | Interactive Silicon, Inc. | System and method for performing scalable embedded parallel data compression |
US20020004823A1 (en) * | 2000-07-06 | 2002-01-10 | Anderson Marquette John | Multi-processor system verification circuitry |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8195858B1 (en) * | 2009-07-28 | 2012-06-05 | Nvidia Corporation | Managing conflicts on shared L2 bus |
US8321618B1 (en) | 2009-07-28 | 2012-11-27 | Nvidia Corporation | Managing conflicts on shared L2 bus |
US9305616B2 (en) | 2012-07-17 | 2016-04-05 | Samsung Electronics Co., Ltd. | Semiconductor memory cell array having fast array area and semiconductor memory including the same |
US9384092B2 (en) | 2013-06-26 | 2016-07-05 | Samsung Electronics Co., Ltd. | Semiconductor memory device with multiple sub-memory cell arrays and memory system including same |
Also Published As
Publication number | Publication date |
---|---|
US20030014590A1 (en) | 2003-01-16 |
US6789168B2 (en) | 2004-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7188217B2 (en) | Embedded DRAM cache memory and method having reduced latency | |
US8250254B2 (en) | Offloading input/output (I/O) virtualization operations to a processor | |
US11269774B2 (en) | Delayed snoop for improved multi-process false sharing parallel thread performance | |
US6725336B2 (en) | Dynamically allocated cache memory for a multi-processor unit | |
US6425060B1 (en) | Circuit arrangement and method with state-based transaction scheduling | |
US9032145B2 (en) | Memory device and method having on-board address protection system for facilitating interface with multiple processors, and computer system using same | |
US5623632A (en) | System and method for improving multilevel cache performance in a multiprocessing system | |
US7669011B2 (en) | Method and apparatus for detecting and tracking private pages in a shared memory multiprocessor | |
US20070143546A1 (en) | Partitioned shared cache | |
US7216201B2 (en) | Parallel cachelets | |
US20080126707A1 (en) | Conflict detection and resolution in a multi core-cache domain for a chip multi-processor employing scalability agent architecture | |
US20030065843A1 (en) | Next snoop predictor in a host controller | |
US6789168B2 (en) | Embedded DRAM cache | |
US20020169935A1 (en) | System of and method for memory arbitration using multiple queues | |
US20090006777A1 (en) | Apparatus for reducing cache latency while preserving cache bandwidth in a cache subsystem of a processor | |
US6748512B2 (en) | Method and apparatus for mapping address space of integrated programmable devices within host system memory | |
US7882309B2 (en) | Method and apparatus for handling excess data during memory access | |
US6918021B2 (en) | System of and method for flow control within a tag pipeline | |
US5727179A (en) | Memory access method using intermediate addresses | |
JPH10301850A (en) | Method and system for providing pseudo fine inclusion system in sectored cache memory so as to maintain cache coherency inside data processing system | |
US6836823B2 (en) | Bandwidth enhancement for uncached devices | |
US20010037426A1 (en) | Interrupt handling via a proxy processor | |
US7120758B2 (en) | Technique for improving processor performance | |
US6467030B1 (en) | Method and apparatus for forwarding data in a hierarchial cache memory architecture | |
US7051175B2 (en) | Techniques for improved transaction processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |