US20070294481A1 - Snoop filter directory mechanism in coherency shared memory system - Google Patents

Snoop filter directory mechanism in coherency shared memory system Download PDF

Info

Publication number
US20070294481A1
US20070294481A1 US11/848,960 US84896007A US2007294481A1 US 20070294481 A1 US20070294481 A1 US 20070294481A1 US 84896007 A US84896007 A US 84896007A US 2007294481 A1 US2007294481 A1 US 2007294481A1
Authority
US
United States
Prior art keywords
cache
processor
remote
memory
castout
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/848,960
Inventor
Russell Hoover
Eric Mejdrich
Jon Kriegel
Sandra Woodward
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Core Brands LLC
Original Assignee
Hoover Russell D
Mejdrich Eric O
Kriegel Jon K
Woodward Sandra S
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hoover Russell D, Mejdrich Eric O, Kriegel Jon K, Woodward Sandra S filed Critical Hoover Russell D
Priority to US11/848,960 priority Critical patent/US20070294481A1/en
Publication of US20070294481A1 publication Critical patent/US20070294481A1/en
Assigned to BANK OF AMERICA, N.A. reassignment BANK OF AMERICA, N.A. SECURITY AGREEMENT Assignors: AIGIS MECHTRONICS, INC., BROAN-MEXICO HOLDINGS, INC., BROAN-NUTONE LLC, BROAN-NUTONE STORAGE SOLUTIONS LP, CES GROUP, INC., CES INTERNATIONAL LTD., CLEANPAK INTERNATIONAL, INC., ELAN HOME SYSTEMS, L.L.C., GATES THAT OPEN, LLC, GEFEN, LLC, GOVERNAIR CORPORATION, HC INSTALLATIONS, INC., HUNTAIR, INC., INTERNATIONAL ELECTRONICS, LLC, LINEAR LLC, LITE TOUCH, INC., MAGENTA RESEARCH LTD., MAMMOTH-WEBCO, INC., NILES AUDIO CORPORATION, NORDYNE INTERNATIONAL, INC., NORDYNE LLC, NORTEK INTERNATIONAL, INC., NORTEK, INC., NUTONE LLC, OMNIMOUNT SYSTEMS, INC., OPERATOR SPECIALTY COMPANY, INC., PACIFIC ZEPHYR RANGE HOOD INC., PANAMAX LLC, RANGAIRE GP, INC., RANGAIRE LP, INC., SECURE WIRELESS, INC., SPEAKERCRAFT, LLC, TEMTROL, INC., XANTECH LLC, ZEPHYR VENTILATION, LLC
Assigned to AVC GROUP, LLC, THE reassignment AVC GROUP, LLC, THE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: XANTECH LLC
Assigned to CORE BRANDS, LLC reassignment CORE BRANDS, LLC MERGER AND CHANGE OF NAME Assignors: AVC GROUP, LLC, THE
Assigned to BROAN-NUTONE LLC, HUNTAIR, INC., LITE TOUCH, INC., ZEPHYR VENTILATION, LLC, BROAN-NUTONE STORAGE SOLUTIONS LP, PACIFIC ZEPHYR RANGE HOOD, INC., NILES AUDIO CORPORATION, OPERATOR SPECIALTY COMPANY, INC., CLEANPAK INTERNATIONAL, INC., SPEAKERCRAFT, LLC, RANGAIRE LP, INC., RANGAIRE GP, INC., PANAMAX LLC, CES INTERNATIONAL LTD., NUTONE LLC, GATES THAT OPEN, LLC, INTERNATIONAL ELECTRONICS, LLC, NORDYNE INTERNATIONAL, INC., NORDYNE LLC, MAGENTA RESEARCH LTD., OMNIMOUNT SYSTEMS, INC., MAMMOTH-WEBCO, INC., CES GROUP, INC., SECURE WIRELESS, INC., LINEAR LLC, XANTECH LLC, NORTEK, INC., GEFEN, LLC, BROAN-MEXICO HOLDINGS, INC., ELAN HOME SYSTEMS, L.L.C., HC INSTALLATIONS, INC., TEMTROL, INC., AIGIS MECHTRONICS, INC., NORTEK INTERNATIONAL, INC., GOVERNAIR CORPORATION reassignment BROAN-NUTONE LLC TERMINATION AND RELEASE OF SECURITY IN PATENTS Assignors: BANK OF AMERICA, N.A.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • G06F12/0835Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means for main memory peripheral accesses (e.g. I/O or DMA)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • G06F12/0822Copy directories
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/30Providing cache or TLB in specific location of a processing system
    • G06F2212/302In image processor or graphics adapter

Definitions

  • This application generally relates to data processing systems and, more particularly, to systems in which multiple processing devices may access the same shared data stored in memory.
  • a processor has one or more caches to provide fast access to data (including instructions) stored in relatively slow (by comparison to the cache) external main memory.
  • other devices on the system e.g., a graphics processing unit-GPU
  • This snoop logic is used to determine if desired data is contained in the processor cache and if it is the most recent (modified) copy, typically by querying the processor cache directory. If so, in order to work with the latest copy of the data, the device may request ownership of the modified copy stored in a processor cache line.
  • devices requesting data do not know ahead of time whether the data is in a processor cache. As a result, each device must query (snoop) the processor cache directory for every memory location that it wishes to access from main memory to make sure that proper data coherency is maintained, which can be very expensive both in terms of both command latency and microprocessor bus bandwidth.
  • Embodiments of the present invention generally provide methods and apparatus that may be utilized to maintain coherency of data accessed by a remote device that may reside in a cache of a processor.
  • One embodiment provides a method of maintaining coherency of data accessed by a remote device.
  • the method generally includes maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device and a castout buffer indicating cache lines that have been or will be castout from the processor cache.
  • Memory requests issued at the remote device may be routed to the memory device or the processor cache, depending on information contained in the remote cache directory and castout buffer.
  • Another embodiment provides a method of maintaining coherency of data accessed by a remote device.
  • the method generally includes maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device.
  • a memory request issued at the remote device may be routed to the processor cache if an address targeted by the memory request matches an entry in the remote cache directory.
  • An entry in an outstanding transaction buffer residing on the remote device may be created, the entry containing the address targeted by the memory request routed to the processor cache.
  • the device configured to access data stored in memory and cacheable by a processor.
  • the device generally includes one or more processing cores, a remote cache directory indicative of contents of a cache residing on the processor, a castout buffer indicating cache lines that have been or will be castout from the processor cache, and coherency logic.
  • the coherency logic is generally configured to receive cache coherency information indicative of changes to the contents of the processor cache sent by the processor in bus transactions and update the cache directory and castout buffer based on the cache coherency information.
  • the processor generally includes a cache for storing data accessed from external memory, a cache directory with entries indicating which memory locations are stored in cache lines of the cache and corresponding coherency states thereof, and control logic configured to detect internal bus transactions indicating the allocation and de-allocation of cache lines and, in response, generate bus transactions, each containing cache coherency information indicating cache line that has been allocated or de-allocated.
  • the remote device generally includes a remote cache directory indicative of contents of the cache residing on the processor, a castout buffer indicating cache lines that have been or will be castout from the processor cache, and coherency logic configured to update the remote cache directory, based on cache coherency information contained in the external bus transactions generated by the processor control logic, to reflect allocated and de-allocated cache lines of the processor cache.
  • FIG. 1 illustrates an exemplary system in accordance with embodiments of the present invention
  • FIG. 2 illustrates an exemplary coherency (snoop) logic configuration, in accordance with embodiments of the present invention
  • FIG. 3 is a flow diagram of exemplary operations for maintaining a remote cache directory and castout buffer, in accordance with embodiments of the present invention
  • FIGS. 4A and 4B illustrate exemplary bits/signals used for enhanced bus transactions used to maintain a remote cache directory, in accordance with embodiments of the present invention
  • FIG. 5 is a flow diagram of exemplary operations for routing remote device memory access requests, in accordance with embodiments of the present invention.
  • FIGS. 6A-6C illustrate exemplary data patjh diagrams for remote device memory access requests, in accordance with embodiments of the present invention
  • FIG. 7 is a flow diagram of exemplary operations for routing remote device memory access requests, in accordance with embodiments of the present invention.
  • Embodiments of the present invention generally provide methods and apparatus that may be utilized to maintain coherency of data accessed by both a processor and a remote device.
  • various mechanisms such as a remote cache directory, castout buffer, and/or outstanding transaction buffer may be utilized by the remote device to track the state of processor cache lines that may hold data targeted by requests initiated by the remote device. Based on the content of these mechanisms, only those requests that target cache lines indicated to be valid in the processor cache may be routed to the processor, thus conserving bus bandwidth. Other requests targeting data that is not in the processor cache may be routed directly to memory, thus reducing overall latency.
  • cache coherency refers to the generally desirable property that accessing a copy of data (a cache line) from a cache gives the same value as the underlying data, even when the data was modified by a different process after the data was first cached. Maintaining cache coherency is important for consistent operation of multiprocessor systems in which one or more processor has a non-shared cache used to cache portions of a memory area shared by multiple processors.
  • virtual channel generally a data path that carries both request and/or response information between components. Each virtual channel typically utilizes a different buffer, with a virtual channel number indicating which buffer a packet transferred on that virtual channel will use.
  • Virtual channels are referred to as virtual because, while multiple virtual channels may utilize a single common physical interface (e.g., a bus), they appear and act as separate channels. Virtual channels may be implemented using various logic components (e.g., switches, multiplexors, etc.) utilized to route data, received over the common bus, from different sources to different destinations, in effect, as if there were separate physical channels between each source and destination.
  • An advantage to utilizing virtual channels is that various processes utilizing the data streamed by the virtual channels may operate in parallel which may improve system performance (e.g., while one process is receiving/sending data over the bus, another process may be manipulating data and not need the bus).
  • FIG. 1 schematically illustrates an exemplary multi-processor system 100 in which a processor (illustratively, a CPU 102 ) and a remote processor device (illustratively, a GPU 104 ) both access a shared main memory 138 .
  • main memory 138 is near the GPU 104 and is accessed by a memory controller 130 which, for some embodiments, is integrated with (i.e., located on) the GPU 104 .
  • the system 100 is merely one example of a type of system in which embodiments of the present invention may be utilized to maintain coherency of data accessed by multiple devices.
  • the CPU 102 and the GPU 104 communicate via a front side bus (FSB) 106 .
  • the CPU 102 illustratively includes a plurality of processor cores 108 , 110 , and 112 that perform tasks under the control of software.
  • the processor cores may each include any number of different type function units including, but not limited to arithmetic logic units (ALUs), floating point units (FPUs), and single instruction multiple data (SIMD) units. Examples of CPUs utilizing multiple processor cores include the Power PC line of CPUs, available from IBM.
  • Each individual core may have a corresponding L1 cache 160 and may communicate over a common bus 116 that connects to a core bus interface 118 .
  • the individual cores may share an L2 (secondary) cache memory 114 .
  • the L2 cache 114 may include a cache array 111 , cache directory 115 , and cache controller 113 .
  • the L2 cache 114 may be an associative cache and the cache directory 114 may include entries indicating addresses of cache lines stored in each “way” of an associative set, as well as an indication of a coherency state of each line.
  • the L2 cache 114 may be operated in accordance with the MESI protocol (supporting Modified, Exclusive, Shared, and Invalid states), or some variant thereof.
  • the core bus interface 118 communicates with the L2 cache memory 114 , and carries data transferred into and out of the CPU 102 via the FSB 106 , through a front-side bus interface 120 .
  • the GPU 104 also includes a front-side bus interface 124 that connects to the FSB 106 and that is used to pass information between the GPU 104 and the CPU 102 .
  • the GPU 104 is a device capable of processing large amounts of data at very high speed using sophisticated data structures and processing techniques. To do so, the GPU 104 includes at least one graphics core 128 that processes data obtained from the CPU 102 or from main memory 138 via the memory controller 130 .
  • the memory controller 130 connects to the graphics front-side bus interface 124 via a bus interface unit (BIU) 123 . Data passes between the graphics core 128 and the memory controller 130 over a wide parallel bus 132 .
  • the main memory 138 typically stores operating routines, application programs, and corresponding data that may be accessed by the CPU 102 and GPU 104 .
  • the GPU 104 may also include an I/O port 140 that connects to an I/O driver (master device) 142 .
  • the I/O driver 142 passes data to and from any number of external devices, such as a mouse, video joy stick, computer board, and display, via an I/O slave device 141 .
  • the I/O driver 142 properly formats data and passes data to and from the graphic front-side bus interface 124 . That data is then passed to or from the CPU 102 or is used in the GPU 104 , possibly being stored in the main memory 138 by way of the memory controller 130 .
  • the graphics cores 128 , memory controller 130 , and I/O driver 142 may all communicate with the BIU 123 that provides access to the FSB via the GPU's FSB interface 124 .
  • system 100 in which one or more remote devices request access to data for memory locations that are cached by a central processor, the remote devices often utilize some type of coherency logic to monitor (snoop) the contents of the processor cache. Typically, this snoop logic interrogates the processor cache directory for entries for every memory location the remote device wishes to access. As a result, conventional cache snooping may result in substantial latency and consume a significant amount of processor bus bandwidth.
  • embodiments of the present invention may utilize coherency logic 127 on the remote device (in this example, the GPU 104 ), which may include a snoop filter 125 , a castout buffer 121 , and an outstanding transaction buffer 129 .
  • FIG. 2 illustrates a relational view of one system configuration utilizing these components to maintain coherency.
  • the coherency logic 127 may be generally configured to route requests received by a GPU core 128 (or I/O master) to the CPU 102 or directly to memory, depending on the information contained in the snoop filter 125 , castout buffer 121 , and outstanding transaction buffer 129 .
  • the castout buffer 121 may be used to track the addresses of cache lines for which data is expected to be returned (in some cases castout) by the CPU 102 .
  • the outstanding transaction buffer 129 may be used to track addresses targeted by “in-flight” requests routed from the GPU 104 to the CPU 102 , indicating data for these addresses may be expected.
  • the snoop filter 125 may maintain a remote cache directory 126 which provides, at the GPU 104 , an indication of entries in the L2 cache directory 115 on the CPU 102 . Accordingly, when a remote device attempts to access data in a memory location, the snoop filter 125 may check the remote cache directory 126 to determine if a modified copy of the data is cached at the CPU 102 without having to send bus commands to the CPU 102 . As a result, the snoop filter 125 may “filter out” requests to access data that is not cached in the CPU 102 and route those requests directly to memory 138 , via the memory controller 130 , thus reducing latency and increasing bus bandwidth.
  • the snoop filter 125 may operate in concert with a cache controller 113 which may generate enhanced bus transactions containing cache coherency information used by the snoop filter 125 to update the remote cache directory 126 to reflect changes to the CPU cache directory 115 .
  • the CPU 102 may include various components (that interface with the L2 cache controller and bus interface) to support system coherency and respond to requests received from the GPU 104 .
  • Such components may include memory agents 202 and 206 to route requests to and receive responses from, respectively, memory 138 , as well as a GPU agent 204 to route requests to and receive responses from the GPU cores 128 (or I/O masters).
  • These agents may communicate with the GPU 104 via virtual channels 210 established on the FSB.
  • the virtual channels 210 include “upbound” virtual channels 216 and 218 to handle requests and responses, respectively, from the GPU 104 and “downbound” virtual channels 212 and 214 to handle requests and responses, respectively, from the CPU 102 . Data paths through the virtual channels 210 for different transactions under different circumstances are described in detail below, with reference to FIGS. 6A-6C .
  • the snoop filter 125 may monitor requests issued from the CPU 102 in an effort to ensure the remote cache directory 126 mirrors the CPU cache directory 115 , and accurately reflects the contents and coherency state of the CPU cache 114 .
  • FIG. 3 illustrates exemplary operations 300 that may be performed (e.g., by the snoop filter 125 ) to update the remote cache directory 126 based on requests issued by the CPU 102 indicating a new cache line is being allocated in the L2 cache 114 .
  • the operations 300 begin, at step 302 , by receiving a (read allocation) request from the CPU 102 .
  • the request may be an enhanced bus transaction containing additional coherency information allowing the snoop filter to update the remote cache directory 126 , as described in the commonly owned U.S. patent application entitled “Enhanced Bus Transactions for Efficient Support of a Remote Cache Directory Copy” (Attorney Docket No. ROC920040036US1).
  • This information may include an indication that an allocation or de-allocation transaction occurred and, if so, a particular cache line (e.g., a “way” within an associative set) that is being replaced.
  • the information may also include an indication of whether an aging castout was or will be generated (i.e., resulting in modified data being written back to memory).
  • These bus transactions may be considered enhanced because this additional coherency information may be added to information already included in a bus transaction occurring naturally. For example, a cache line allocation may naturally precede a bus transaction to read requested data to fill the allocated cache line.
  • a valid bit of the old entry in the remote cache directory 126 (being replaced by the new entry) is examined. If the old entry is invalid, the new entry is allocated in the remote cache directory 126 , at step 306 . If the old entry is valid, however, a bit provided in the allocation request is examined to determine if the cached entry being replaced is to be castout, at step 308 . If so, the GPU 104 can expect this data to be transferred (castout) from the CPU, and the old entry is copied to the castout buffer 121 , at step 310 . Thus, when the GPU 104 requests data, the castout buffer 121 may be examined to determine if a castout is pending (as shown in FIG. 5 ).
  • a castout (or other transfer) of the cacheline may still be pending, if the cacheline was targeted by an outstanding read or flush issued by the GPU 104 .
  • the old entry (being replaced by the new allocation) may be compared against entries in the read/flush outstanding buffer, at step 312 .
  • a match indicates there is an outstanding read/flush request targeting the cacheline and, hence, the old entry is copied into the castout buffer 121 , at step 310 , prior to allocating the new entry in the remote directory (step 306 ).
  • a mismatch indicates there is no such outstanding request, and the new entry is allocated, without copying the old entry into the castout buffer 121 .
  • FIGS. 4A and 4B summarize the type of coherency information provided upon allocation and de-allocation, respectively.
  • the coherency information may include a valid bit (rc_way_alloc_v) indicating whether or not a new entry is being allocated, set_id bits (rc_way_alloc[0:N]) indicating the way of the cache line being allocated, and an aging bit (rc_aging) indicating whether an aging castout (e.g., of a modified cache line) is being issued. If the valid bit is inactive, the remaining bits may be ignored, since a new entry is not being allocated (e.g., a cache line for a targeted memory location already exists in L2 cache).
  • the coherency information may be sent with each such transaction, even when a new line is not being allocated, to avoid having separate transactions for transferring coherency information.
  • the GPU 104 may quickly check the valid bit to determine if a new cache line is being allocated.
  • the aging bit set indicates an aging castout is being issued, for example, since the coherency state of the aging L2 cache line is modified (M).
  • the aging bit cleared indicates that the entry being replaced is not being castout, for example, because the aging L2 entry was invalid (I), shared (S), or exclusive (E), and can be overwritten with this new allocation.
  • the remote cache directory 126 may indicate more valid cache lines are in the L2 cache 114 than are indicated by the CPU cache directory 115 (e.g., the valid cache lines indicated by the remote cache directory may represent a superset of the actual valid cache lines). This is because cache lines in the L2 cache 114 may transition from Exclusive (E) or Shared (S) to Invalid (I) without any corresponding bus operations to signal these transitions. While this may result in occasional additional requests sent from the GPU 104 to the CPU 102 (the CPU 102 can respond that its copy is invalid), it is also a safe approach aimed at ensuring the CPU is always checked if the remote cache directory 126 indicates requested data is cached. As will be described in greater detail below, these requests may be “reflected” back to the GPU to be routed to memory.
  • E Exclusive
  • S Shared
  • I Invalid
  • L2 cache lines are de-allocated (e.g., due to a write with kill)
  • enhanced bus transactions containing coherency information related to the de-allocation may also be generated.
  • This coherency information may include an indication an entry is being de-allocated and the set_id (way) indicating which cache line within an associative set being de-allocated.
  • This information may be generated by “push snoop logic” in the L2 cache 114 and carried in a set of control bits/signals, as with the previously described coherency information transmitted upon cache line allocation.
  • This coherency information will be used by the GPU snoop filter 125 to correctly invalidate the corresponding entry in the (L2 superset) remote cache directory 126 . As illustrated in FIG.
  • the coherency information related to the de-allocation may be carried in similar bits/signals (valid and set_id) to those related to allocation shown in FIG. 4A . As the de-allocation assumes a castout, there may be no need for an aging bit.
  • FIG. 5 is a flow diagram of exemplary operations 500 for routing remote device memory access requests based on information maintained in the remote cache (snoop filter) directory 126 and castout buffer 121 , in accordance with embodiments of the present invention. While the operations are described with reference to requests issued by a GPU (core), it should be understood the same or similar operations may be performed to route requests from any requesting entity.
  • a GPU GPU
  • the operations 500 begin, at step 502 , by receiving a request from the GPU 104 .
  • the snoop filter directory 126 is checked in an effort to determine if a cache line containing data targeted by the request is in the L2 cache 114 of the CPU 102 .
  • a hit an entry with a matching entry and valid state indicates a targeted cache line is in the L2 cache 114 , while a miss indicates one is not.
  • the castout buffer is checked, at step 516 , for an indication a castout of a targeted cache line is pending. If a castout is pending, there is a risk that stale data might be read from memory if the request is issued before the modified data is written back to memory, so the GPU waits for the pending castout, at step 520 .
  • the request is routed to memory, at step 518 .
  • the request may be issued directly against memory, without having to send any time consuming snoop requests to the CPU.
  • FIG. 6A This scenario is illustrated in the exemplary data path diagram of FIG. 6A , in which various events are enumerated (1-4).
  • a GPU core issues a request (1).
  • the request misses in the snoop filter directory 126 and castout buffer 121 (2), indicating a targeted cache line does not presently reside in the L2 cache 114 . Accordingly, the request is routed to memory, via the memory controller 130 (3).
  • the memory controller 130 returns the requested data to the GPU core (4).
  • a check of the snoop filter directory, at step 504 , resulting in a hit indicates a cache line containing data targeted by the request is in the L2 cache 114 .
  • the coherency logic 127 may send a request to tell the CPU 102 to invalidate its cached copy of the targeted memory location (if the copy was not modified) or cast out its copy (if it was modified). To track these pending operations, and handle subsequent accesses targeting the same memory locations, a copy of the targeted address is stored in the read/write outstanding buffer 129 , at step 506 . At step 508 , a request to invalidate/castout its copy is routed to the CPU 102 .
  • the CPU may respond with data (if castout) or at least some type of response indicating the request was processed. Therefore, at step 510 , the GPU 104 may receive response data or a reflected read (described in greater detail below). At step 512 , the entry from the read/write outstanding buffer 129 may be removed.
  • FIGS. 6B and 6C Data paths for requests that hit in the snoop filter directory 126 are illustrated in FIGS. 6B and 6C , in which various events are again enumerated.
  • FIG. 6B illustrates the routing of a request for data that is cached in the L2 114 in a valid state, and returned from the CPU directly to a requesting GPU core.
  • a GPU core issues a request (1).
  • the request hits in the snoop filter directory 126 , indicating a targeted cache line resides in the L2 cache 114 . Accordingly, the request is routed to the L2 114 (3).
  • the L2 114 logic may respond by sending a response with the requested data directly to the GPU core (4).
  • This approach may reduce latency by eliminating the need for the GPU core to generate a separate response to read the requested memory.
  • the data may be marked as dirty in the response, causing the GPU 104 to generate a write to memory.
  • the GPU 104 may access a special set of registers, referred to as a lock set, that does not require backing to memory (e.g., the GPU reads, but never writes to these registers).
  • a lock set a special set of registers
  • FIG. 6C illustrates the routing of a request for data that results in a hit with the remote cache directory 126 but the data is not cached in the L2 in a valid state.
  • the L2 cache may return NULL data, causing reflection logic 208 in the CPU 102 to respond with what may be referred to as “reflected” read (or write) requests that are, in effect requests reflected back to the GPU 104 to be routed to the memory controller 130 for execution against memory (e.g., on behalf of the requesting GPU core 128 ).
  • FIG. 7 is a flow diagram of exemplary operations 700 for updating the snoop filter directory 126 , castout buffer 121 , and/or read/write outstanding buffer 129 , in response to certain requests received from the CPU 102 .
  • the operations 700 begin, at step 702 , by receiving such a request from the CPU 102 .
  • requests that cause a change to these coherency mechanisms may include a write with kill, or a reflected read or write.
  • the entry that resulted in the hit is invalidated, at step 706 .
  • the castout buffer 121 may be checked in parallel, at step 708 , with the remote cache directory 126 .
  • a hit also results in the corresponding entry being invalidated, at step 706 . If the request received from the CPU is a reflected read or write, as determined at step 710 , the corresponding entry is removed from the outstanding transaction buffer 129 , at step 712 .
  • Removing the entry (that was created when the coherency logic routed the request resulting in the reflected read/write request to the L2, per step 506 of FIG. 5 ) is done because the request is no longer “in flight.”
  • the request is then routed to memory, at step 714 .
  • Coherency support structures e.g., a remote cache directory, castout buffer, and outstanding transaction buffer
  • a remote cache directory e.g., a remote cache directory, castout buffer, and outstanding transaction buffer
  • the mechanisms may be checked at the remote device to determine whether to route a memory request to the L2 cache or directly to memory, which may result in significant reductions in latency.
  • These mechanisms may be updated by monitoring memory access requests issued by the processor, as well as the remote device, avoiding unnecessary snoop requests.

Abstract

Methods and apparatus that may be utilized to maintain coherency of data accessed by both a processor and a remote device are provided. Various mechanisms, such as a remote cache directory, castout buffer, and/or outstanding transaction buffer may be utilized by the remote device to track the state of processor cache lines that may hold data targeted by requests initiated by the remote device. Based on the content of these mechanisms, requests targeting data that is not in the processor cache may be routed directly to memory, thus reducing overall latency.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation application of co-pending U.S. patent application Ser. No. 10/961,749 filed Oct. 8, 2004, which is herein incorporated by reference.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This application generally relates to data processing systems and, more particularly, to systems in which multiple processing devices may access the same shared data stored in memory.
  • 2. Description of the Related Art
  • In a multiprocessor system, or any type of system that allows more than one device to request and update blocks of shared data concurrently, it is important that some mechanism exists to keep the data coherent (i.e., to ensure that each copy of data accessed by any device is the most current copy). In many such systems, a processor has one or more caches to provide fast access to data (including instructions) stored in relatively slow (by comparison to the cache) external main memory. In an effort to maintain coherency, other devices on the system (e.g., a graphics processing unit-GPU) may include some type of coherency or “snoop” logic to determine if a copy of data from a desired memory location is held in the processor cache by sending commands (snoop requests) to a processor cache directory.
  • This snoop logic is used to determine if desired data is contained in the processor cache and if it is the most recent (modified) copy, typically by querying the processor cache directory. If so, in order to work with the latest copy of the data, the device may request ownership of the modified copy stored in a processor cache line. In a conventional coherent system, devices requesting data do not know ahead of time whether the data is in a processor cache. As a result, each device must query (snoop) the processor cache directory for every memory location that it wishes to access from main memory to make sure that proper data coherency is maintained, which can be very expensive both in terms of both command latency and microprocessor bus bandwidth.
  • Accordingly, what is needed is an efficient method and system which would reduce the amount of latency associated with interfacing with (snooping on) a processor cache.
  • SUMMARY OF THE INVENTION
  • Embodiments of the present invention generally provide methods and apparatus that may be utilized to maintain coherency of data accessed by a remote device that may reside in a cache of a processor.
  • One embodiment provides a method of maintaining coherency of data accessed by a remote device. The method generally includes maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device and a castout buffer indicating cache lines that have been or will be castout from the processor cache. Memory requests issued at the remote device may be routed to the memory device or the processor cache, depending on information contained in the remote cache directory and castout buffer.
  • Another embodiment provides a method of maintaining coherency of data accessed by a remote device. The method generally includes maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device. A memory request issued at the remote device may be routed to the processor cache if an address targeted by the memory request matches an entry in the remote cache directory. An entry in an outstanding transaction buffer residing on the remote device may be created, the entry containing the address targeted by the memory request routed to the processor cache.
  • Another embodiment provides a device configured to access data stored in memory and cacheable by a processor. The device generally includes one or more processing cores, a remote cache directory indicative of contents of a cache residing on the processor, a castout buffer indicating cache lines that have been or will be castout from the processor cache, and coherency logic. The coherency logic is generally configured to receive cache coherency information indicative of changes to the contents of the processor cache sent by the processor in bus transactions and update the cache directory and castout buffer based on the cache coherency information.
  • Another embodiment provides a coherent system generally including a processor and a remote device. The processor generally includes a cache for storing data accessed from external memory, a cache directory with entries indicating which memory locations are stored in cache lines of the cache and corresponding coherency states thereof, and control logic configured to detect internal bus transactions indicating the allocation and de-allocation of cache lines and, in response, generate bus transactions, each containing cache coherency information indicating cache line that has been allocated or de-allocated. The remote device generally includes a remote cache directory indicative of contents of the cache residing on the processor, a castout buffer indicating cache lines that have been or will be castout from the processor cache, and coherency logic configured to update the remote cache directory, based on cache coherency information contained in the external bus transactions generated by the processor control logic, to reflect allocated and de-allocated cache lines of the processor cache.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • So that the manner in which the above recited features, advantages and objects of the present invention are attained and can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments thereof which are illustrated in the appended drawings.
  • It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
  • FIG. 1 illustrates an exemplary system in accordance with embodiments of the present invention;
  • FIG. 2 illustrates an exemplary coherency (snoop) logic configuration, in accordance with embodiments of the present invention;
  • FIG. 3 is a flow diagram of exemplary operations for maintaining a remote cache directory and castout buffer, in accordance with embodiments of the present invention;
  • FIGS. 4A and 4B illustrate exemplary bits/signals used for enhanced bus transactions used to maintain a remote cache directory, in accordance with embodiments of the present invention;
  • FIG. 5 is a flow diagram of exemplary operations for routing remote device memory access requests, in accordance with embodiments of the present invention;
  • FIGS. 6A-6C illustrate exemplary data patjh diagrams for remote device memory access requests, in accordance with embodiments of the present invention;
  • FIG. 7 is a flow diagram of exemplary operations for routing remote device memory access requests, in accordance with embodiments of the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • Embodiments of the present invention generally provide methods and apparatus that may be utilized to maintain coherency of data accessed by both a processor and a remote device. For some embodiments, various mechanisms, such as a remote cache directory, castout buffer, and/or outstanding transaction buffer may be utilized by the remote device to track the state of processor cache lines that may hold data targeted by requests initiated by the remote device. Based on the content of these mechanisms, only those requests that target cache lines indicated to be valid in the processor cache may be routed to the processor, thus conserving bus bandwidth. Other requests targeting data that is not in the processor cache may be routed directly to memory, thus reducing overall latency.
  • As used herein, the term cache coherency refers to the generally desirable property that accessing a copy of data (a cache line) from a cache gives the same value as the underlying data, even when the data was modified by a different process after the data was first cached. Maintaining cache coherency is important for consistent operation of multiprocessor systems in which one or more processor has a non-shared cache used to cache portions of a memory area shared by multiple processors. As used herein, the term virtual channel generally a data path that carries both request and/or response information between components. Each virtual channel typically utilizes a different buffer, with a virtual channel number indicating which buffer a packet transferred on that virtual channel will use. Virtual channels are referred to as virtual because, while multiple virtual channels may utilize a single common physical interface (e.g., a bus), they appear and act as separate channels. Virtual channels may be implemented using various logic components (e.g., switches, multiplexors, etc.) utilized to route data, received over the common bus, from different sources to different destinations, in effect, as if there were separate physical channels between each source and destination. An advantage to utilizing virtual channels is that various processes utilizing the data streamed by the virtual channels may operate in parallel which may improve system performance (e.g., while one process is receiving/sending data over the bus, another process may be manipulating data and not need the bus).
  • In the following description, reference is made to embodiments of the invention. However, it should be understood that the invention is not limited to specific described embodiments. Instead, any combination of the following features and elements, whether related to different embodiments or not, is contemplated to implement and practice the invention. Furthermore, in various embodiments the invention provides numerous advantages over the prior art. However, although embodiments of the invention may achieve advantages over other possible solutions and/or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the invention. Thus, the following aspects, features, embodiments and advantages are merely illustrative and, unless explicitly present, are not considered elements or limitations of the appended claims.
  • An Exemplary System
  • FIG. 1 schematically illustrates an exemplary multi-processor system 100 in which a processor (illustratively, a CPU 102) and a remote processor device (illustratively, a GPU 104) both access a shared main memory 138. In the illustrated embodiment, main memory 138 is near the GPU 104 and is accessed by a memory controller 130 which, for some embodiments, is integrated with (i.e., located on) the GPU 104. The system 100 is merely one example of a type of system in which embodiments of the present invention may be utilized to maintain coherency of data accessed by multiple devices.
  • As shown, the CPU 102 and the GPU 104 communicate via a front side bus (FSB) 106. The CPU 102 illustratively includes a plurality of processor cores 108, 110, and 112 that perform tasks under the control of software. The processor cores may each include any number of different type function units including, but not limited to arithmetic logic units (ALUs), floating point units (FPUs), and single instruction multiple data (SIMD) units. Examples of CPUs utilizing multiple processor cores include the Power PC line of CPUs, available from IBM. Each individual core may have a corresponding L1 cache 160 and may communicate over a common bus 116 that connects to a core bus interface 118. For some embodiments, the individual cores may share an L2 (secondary) cache memory 114.
  • As illustrated, the L2 cache 114 may include a cache array 111, cache directory 115, and cache controller 113. For some embodiments, the L2 cache 114 may be an associative cache and the cache directory 114 may include entries indicating addresses of cache lines stored in each “way” of an associative set, as well as an indication of a coherency state of each line. For some embodiments, the L2 cache 114 may be operated in accordance with the MESI protocol (supporting Modified, Exclusive, Shared, and Invalid states), or some variant thereof. The core bus interface 118 communicates with the L2 cache memory 114, and carries data transferred into and out of the CPU 102 via the FSB 106, through a front-side bus interface 120.
  • The GPU 104 also includes a front-side bus interface 124 that connects to the FSB 106 and that is used to pass information between the GPU 104 and the CPU 102. The GPU 104 is a device capable of processing large amounts of data at very high speed using sophisticated data structures and processing techniques. To do so, the GPU 104 includes at least one graphics core 128 that processes data obtained from the CPU 102 or from main memory 138 via the memory controller 130. The memory controller 130 connects to the graphics front-side bus interface 124 via a bus interface unit (BIU) 123. Data passes between the graphics core 128 and the memory controller 130 over a wide parallel bus 132. The main memory 138 typically stores operating routines, application programs, and corresponding data that may be accessed by the CPU 102 and GPU 104.
  • For some embodiments, the GPU 104 may also include an I/O port 140 that connects to an I/O driver (master device) 142. The I/O driver 142 passes data to and from any number of external devices, such as a mouse, video joy stick, computer board, and display, via an I/O slave device 141. The I/O driver 142 properly formats data and passes data to and from the graphic front-side bus interface 124. That data is then passed to or from the CPU 102 or is used in the GPU 104, possibly being stored in the main memory 138 by way of the memory controller 130. As illustrated, the graphics cores 128, memory controller 130, and I/O driver 142 may all communicate with the BIU 123 that provides access to the FSB via the GPU's FSB interface 124.
  • As previously described, in conventional multi-processor systems, such as system 100, in which one or more remote devices request access to data for memory locations that are cached by a central processor, the remote devices often utilize some type of coherency logic to monitor (snoop) the contents of the processor cache. Typically, this snoop logic interrogates the processor cache directory for entries for every memory location the remote device wishes to access. As a result, conventional cache snooping may result in substantial latency and consume a significant amount of processor bus bandwidth.
  • Snoop Filter Directory Mechanism
  • In an effort to reduce such latency and increase bus bandwidth, embodiments of the present invention may utilize coherency logic 127 on the remote device (in this example, the GPU 104), which may include a snoop filter 125, a castout buffer 121, and an outstanding transaction buffer 129. FIG. 2 illustrates a relational view of one system configuration utilizing these components to maintain coherency. As illustrated, the coherency logic 127 may be generally configured to route requests received by a GPU core 128 (or I/O master) to the CPU 102 or directly to memory, depending on the information contained in the snoop filter 125, castout buffer 121, and outstanding transaction buffer 129.
  • As will be described in greater detail below, the castout buffer 121 may be used to track the addresses of cache lines for which data is expected to be returned (in some cases castout) by the CPU 102. The outstanding transaction buffer 129 may be used to track addresses targeted by “in-flight” requests routed from the GPU 104 to the CPU 102, indicating data for these addresses may be expected.
  • As illustrated, the snoop filter 125 may maintain a remote cache directory 126 which provides, at the GPU 104, an indication of entries in the L2 cache directory 115 on the CPU 102. Accordingly, when a remote device attempts to access data in a memory location, the snoop filter 125 may check the remote cache directory 126 to determine if a modified copy of the data is cached at the CPU 102 without having to send bus commands to the CPU 102. As a result, the snoop filter 125 may “filter out” requests to access data that is not cached in the CPU 102 and route those requests directly to memory 138, via the memory controller 130, thus reducing latency and increasing bus bandwidth. As will be described in greater detail below, the snoop filter 125 may operate in concert with a cache controller 113 which may generate enhanced bus transactions containing cache coherency information used by the snoop filter 125 to update the remote cache directory 126 to reflect changes to the CPU cache directory 115.
  • As illustrated, the CPU 102 may include various components (that interface with the L2 cache controller and bus interface) to support system coherency and respond to requests received from the GPU 104. Such components may include memory agents 202 and 206 to route requests to and receive responses from, respectively, memory 138, as well as a GPU agent 204 to route requests to and receive responses from the GPU cores 128 (or I/O masters). These agents may communicate with the GPU 104 via virtual channels 210 established on the FSB. The virtual channels 210 include “upbound” virtual channels 216 and 218 to handle requests and responses, respectively, from the GPU 104 and “downbound” virtual channels 212 and 214 to handle requests and responses, respectively, from the CPU 102. Data paths through the virtual channels 210 for different transactions under different circumstances are described in detail below, with reference to FIGS. 6A-6C.
  • For some embodiments, the snoop filter 125 may monitor requests issued from the CPU 102 in an effort to ensure the remote cache directory 126 mirrors the CPU cache directory 115, and accurately reflects the contents and coherency state of the CPU cache 114. For example, FIG. 3 illustrates exemplary operations 300 that may be performed (e.g., by the snoop filter 125) to update the remote cache directory 126 based on requests issued by the CPU 102 indicating a new cache line is being allocated in the L2 cache 114.
  • The operations 300 begin, at step 302, by receiving a (read allocation) request from the CPU 102. In some cases, the request may be an enhanced bus transaction containing additional coherency information allowing the snoop filter to update the remote cache directory 126, as described in the commonly owned U.S. patent application entitled “Enhanced Bus Transactions for Efficient Support of a Remote Cache Directory Copy” (Attorney Docket No. ROC920040036US1). This information may include an indication that an allocation or de-allocation transaction occurred and, if so, a particular cache line (e.g., a “way” within an associative set) that is being replaced. The information may also include an indication of whether an aging castout was or will be generated (i.e., resulting in modified data being written back to memory). These bus transactions may be considered enhanced because this additional coherency information may be added to information already included in a bus transaction occurring naturally. For example, a cache line allocation may naturally precede a bus transaction to read requested data to fill the allocated cache line.
  • At step 304, a valid bit of the old entry in the remote cache directory 126 (being replaced by the new entry) is examined. If the old entry is invalid, the new entry is allocated in the remote cache directory 126, at step 306. If the old entry is valid, however, a bit provided in the allocation request is examined to determine if the cached entry being replaced is to be castout, at step 308. If so, the GPU 104 can expect this data to be transferred (castout) from the CPU, and the old entry is copied to the castout buffer 121, at step 310. Thus, when the GPU 104 requests data, the castout buffer 121 may be examined to determine if a castout is pending (as shown in FIG. 5).
  • Even if the aging bit is not set, a castout (or other transfer) of the cacheline may still be pending, if the cacheline was targeted by an outstanding read or flush issued by the GPU 104. To determine if such requests are pending, the old entry (being replaced by the new allocation) may be compared against entries in the read/flush outstanding buffer, at step 312. A match indicates there is an outstanding read/flush request targeting the cacheline and, hence, the old entry is copied into the castout buffer 121, at step 310, prior to allocating the new entry in the remote directory (step 306). A mismatch indicates there is no such outstanding request, and the new entry is allocated, without copying the old entry into the castout buffer 121.
  • As described in the above-referenced application, similar operations to those shown in FIG. 3 may be performed to update the remote cache directory and castout buffer based on de-allocation information provided by the CPU 102. As with the allocation information, de-allocation information may also be contained in enhanced bus transactions. FIGS. 4A and 4B summarize the type of coherency information provided upon allocation and de-allocation, respectively.
  • As illustrated in FIG. 4A, for some embodiments, the coherency information may include a valid bit (rc_way_alloc_v) indicating whether or not a new entry is being allocated, set_id bits (rc_way_alloc[0:N]) indicating the way of the cache line being allocated, and an aging bit (rc_aging) indicating whether an aging castout (e.g., of a modified cache line) is being issued. If the valid bit is inactive, the remaining bits may be ignored, since a new entry is not being allocated (e.g., a cache line for a targeted memory location already exists in L2 cache). In other words, the coherency information may be sent with each such transaction, even when a new line is not being allocated, to avoid having separate transactions for transferring coherency information. In such embodiments, the GPU 104 may quickly check the valid bit to determine if a new cache line is being allocated.
  • If the valid bit is set, the set_id bits may be examined to determine which cache line of an associate set is being allocated. For example, for a 4-way associate cache (N=1), a two bit set_id may indicate one of 4 available cache lines, for an 8-way associative cache (N=2), a 3-bit set_id may indicate one of 8 available cache lines, and so on. As an alternative, individual bits (or signals) for each of the ways of the set may be used which, in some cases, may provide improved timing.
  • The aging bit set indicates an aging castout is being issued, for example, since the coherency state of the aging L2 cache line is modified (M). The aging bit cleared indicates that the entry being replaced is not being castout, for example, because the aging L2 entry was invalid (I), shared (S), or exclusive (E), and can be overwritten with this new allocation.
  • It should be noted that, in some cases, the remote cache directory 126 may indicate more valid cache lines are in the L2 cache 114 than are indicated by the CPU cache directory 115 (e.g., the valid cache lines indicated by the remote cache directory may represent a superset of the actual valid cache lines). This is because cache lines in the L2 cache 114 may transition from Exclusive (E) or Shared (S) to Invalid (I) without any corresponding bus operations to signal these transitions. While this may result in occasional additional requests sent from the GPU 104 to the CPU 102 (the CPU 102 can respond that its copy is invalid), it is also a safe approach aimed at ensuring the CPU is always checked if the remote cache directory 126 indicates requested data is cached. As will be described in greater detail below, these requests may be “reflected” back to the GPU to be routed to memory.
  • When L2 cache lines are de-allocated (e.g., due to a write with kill), enhanced bus transactions containing coherency information related to the de-allocation may also be generated. This coherency information may include an indication an entry is being de-allocated and the set_id (way) indicating which cache line within an associative set being de-allocated. This information may be generated by “push snoop logic” in the L2 cache 114 and carried in a set of control bits/signals, as with the previously described coherency information transmitted upon cache line allocation. This coherency information will be used by the GPU snoop filter 125 to correctly invalidate the corresponding entry in the (L2 superset) remote cache directory 126. As illustrated in FIG. 4B, the coherency information related to the de-allocation may be carried in similar bits/signals (valid and set_id) to those related to allocation shown in FIG. 4A. As the de-allocation assumes a castout, there may be no need for an aging bit.
  • Routing Remote Device Memory Requests
  • FIG. 5 is a flow diagram of exemplary operations 500 for routing remote device memory access requests based on information maintained in the remote cache (snoop filter) directory 126 and castout buffer 121, in accordance with embodiments of the present invention. While the operations are described with reference to requests issued by a GPU (core), it should be understood the same or similar operations may be performed to route requests from any requesting entity.
  • The operations 500 begin, at step 502, by receiving a request from the GPU 104. At step 504, the snoop filter directory 126 is checked in an effort to determine if a cache line containing data targeted by the request is in the L2 cache 114 of the CPU 102. A hit (an entry with a matching entry and valid state) indicates a targeted cache line is in the L2 cache 114, while a miss indicates one is not. However, even in the event of a miss, it is possible that a castout of a recently cached line is pending and modified data may be written back to memory. Therefore, the castout buffer is checked, at step 516, for an indication a castout of a targeted cache line is pending. If a castout is pending, there is a risk that stale data might be read from memory if the request is issued before the modified data is written back to memory, so the GPU waits for the pending castout, at step 520.
  • If there is no castout pending, the request is routed to memory, at step 518. In other words, by maintaining coherency information in the snoop cache directory 126 and castout buffer 121, the request may be issued directly against memory, without having to send any time consuming snoop requests to the CPU. This scenario is illustrated in the exemplary data path diagram of FIG. 6A, in which various events are enumerated (1-4). First, a GPU core issues a request (1). Second, the request misses in the snoop filter directory 126 and castout buffer 121 (2), indicating a targeted cache line does not presently reside in the L2 cache 114. Accordingly, the request is routed to memory, via the memory controller 130 (3). Finally, the memory controller 130 returns the requested data to the GPU core (4).
  • Referring back to FIG. 5, a check of the snoop filter directory, at step 504, resulting in a hit indicates a cache line containing data targeted by the request is in the L2 cache 114. According to some embodiments, of the present invention, the coherency logic 127 may send a request to tell the CPU 102 to invalidate its cached copy of the targeted memory location (if the copy was not modified) or cast out its copy (if it was modified). To track these pending operations, and handle subsequent accesses targeting the same memory locations, a copy of the targeted address is stored in the read/write outstanding buffer 129, at step 506. At step 508, a request to invalidate/castout its copy is routed to the CPU 102. Depending on the state of the targeted data, the CPU may respond with data (if castout) or at least some type of response indicating the request was processed. Therefore, at step 510, the GPU 104 may receive response data or a reflected read (described in greater detail below). At step 512, the entry from the read/write outstanding buffer 129 may be removed.
  • Data paths for requests that hit in the snoop filter directory 126 are illustrated in FIGS. 6B and 6C, in which various events are again enumerated. FIG. 6B illustrates the routing of a request for data that is cached in the L2 114 in a valid state, and returned from the CPU directly to a requesting GPU core. First, a GPU core issues a request (1). Second, the request hits in the snoop filter directory 126, indicating a targeted cache line resides in the L2 cache 114. Accordingly, the request is routed to the L2 114 (3). For some embodiments, and in some instances, the L2 114 logic may respond by sending a response with the requested data directly to the GPU core (4).
  • This approach may reduce latency by eliminating the need for the GPU core to generate a separate response to read the requested memory. In some cases, if the data has been modified, it may be marked as dirty in the response, causing the GPU 104 to generate a write to memory. In some cases, however, the GPU 104 may access a special set of registers, referred to as a lock set, that does not require backing to memory (e.g., the GPU reads, but never writes to these registers). The concepts of utilizing such a lock set are described in detail in the commonly owned application, entitled “Direct Access of Cache Lock Set Data Without Backing Memory” (Attorney Docket No. ROC920040048US1), filed herewith.
  • FIG. 6C illustrates the routing of a request for data that results in a hit with the remote cache directory 126 but the data is not cached in the L2 in a valid state. In such cases, the L2 cache may return NULL data, causing reflection logic 208 in the CPU 102 to respond with what may be referred to as “reflected” read (or write) requests that are, in effect requests reflected back to the GPU 104 to be routed to the memory controller 130 for execution against memory (e.g., on behalf of the requesting GPU core 128).
  • FIG. 7 is a flow diagram of exemplary operations 700 for updating the snoop filter directory 126, castout buffer 121, and/or read/write outstanding buffer 129, in response to certain requests received from the CPU 102. The operations 700 begin, at step 702, by receiving such a request from the CPU 102. As illustrated, for some embodiments, requests that cause a change to these coherency mechanisms may include a write with kill, or a reflected read or write.
  • If the request hits in the remote cache (snoop filter) directory 126, as determined at step 704, the entry that resulted in the hit is invalidated, at step 706. This is because a write with kill indicates the corresponding data in the L2 cache is being written out, and a reflected read or write request indicates the data in the L2 cache is no longer valid. As illustrated, the castout buffer 121 may be checked in parallel, at step 708, with the remote cache directory 126. A hit also results in the corresponding entry being invalidated, at step 706. If the request received from the CPU is a reflected read or write, as determined at step 710, the corresponding entry is removed from the outstanding transaction buffer 129, at step 712. Removing the entry (that was created when the coherency logic routed the request resulting in the reflected read/write request to the L2, per step 506 of FIG. 5) is done because the request is no longer “in flight.” The request is then routed to memory, at step 714.
  • CONCLUSION
  • Coherency support structures (e.g., a remote cache directory, castout buffer, and outstanding transaction buffer) on a remote device may be used to indicate the contents of an L2 cache of a processor that shares memory with the remote device and to indicate the status requests targeting data stored in the L2 cache. Accordingly, the mechanisms may be checked at the remote device to determine whether to route a memory request to the L2 cache or directly to memory, which may result in significant reductions in latency. These mechanisms may be updated by monitoring memory access requests issued by the processor, as well as the remote device, avoiding unnecessary snoop requests.
  • While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (19)

1. A method of maintaining coherency of data accessed by a remote device, comprising:
maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device;
maintaining, on the remote device, a castout buffer indicating cache lines that have been or will be castout from the processor cache; and
routing memory requests issued at the remote device to the memory device or the processor cache, depending on information contained in the remote cache directory and castout buffer.
2. The method of claim 1, wherein maintaining the remote cache directory comprises:
receiving, by the remote device, a bus transaction initiated by the processor containing cache coherency information indicating a change to a cache directory residing on the processor; and
updating the remote cache directory, based on the cache coherency information, to reflect the change to the cache directory residing on the processor.
3. The method of claim 2, wherein maintaining, on the remote device, a buffer indicating cache lines that have been castout from the processor cache comprises copying an entry from the remote cache directory to the castout buffer if the cache coherency information indicates an aging castout is to occur at the processor.
4. The method of claim 2, wherein the cache coherency information comprises a set of bits indicating a way within an associative set.
5. The method of claim 1, wherein routing memory requests issued at the remote device to the memory device or the processor cache, depending on information contained in the remote cache directory and castout buffer comprises:
routing memory requests issued at the remote device to memory if an address targeted by the memory request does not match entries in either the remote cache directory or castout buffer.
6. The method of claim 5, further comprising waiting for a castout to occur if an address targeted by the memory request matches an entry in the castout buffer.
7. The method of claim 5, wherein routing memory requests issued at the remote device to the memory device or the processor cache, depending on information contained in the remote cache directory and castout buffer comprises:
routing memory requests issued at the remote device to the processor cache if an address targeted by the memory request matches an entry in the remote cache directory.
8. The method of claim 7, further comprising creating an entry in an outstanding transaction buffer containing an address targeted by the memory request routed to the processor cache.
9. The method of claim 8, further comprising removing the entry from the outstanding transaction buffer after receiving response data from the processor.
10. A method of maintaining coherency of data accessed by a remote device, comprising:
maintaining, on the remote device, a remote cache directory indicative of memory locations residing in a cache on a processor which shares access to some portion of a memory device wherein maintaining the remote cache directory comprises:
receiving, by the remote device, a bus transaction initiated by the processor containing cache coherency information indicating a change to a cache directory residing on the processor; and
updating the remote cache directory, based on the cache coherency information, to reflect the change to the cache directory residing on the processor;
routing a memory request issued at the remote device to the processor cache if an address targeted by the memory request matches an entry in the remote cache directory; and
creating an entry in an outstanding transaction buffer residing on the remote device, the entry containing the address targeted by the memory request routed to the processor cache.
11. The method of claim 10, wherein:
maintaining, on the remote device, a buffer indicating cache lines that have been castout from the processor cache comprises copying an entry from the remote cache directory to the castout buffer if the cache coherency information indicates an aging castout is to occur at the processor; and
copying an entry from the outstanding transaction buffer to the castout buffer in response to detecting a match between an address of a cache line being castout and the entry.
12. A device configured to access data stored in memory and cacheable by a processor, comprising:
one or more processing cores;
a remote cache directory indicative of contents of a cache residing on the processor;
a castout buffer indicating cache lines that have been or will be castout from the processor cache; and
coherency logic configured to receive cache coherency information indicative of changes to the contents of the processor cache sent by the processor in bus transactions and update the cache directory and castout buffer based on the cache coherency information.
13. The device of claim 12, wherein the coherency logic is configured to:
receive cache coherency information indicating a cache line that has been de-allocated by the processor; and
in response, invalidate a corresponding entry in at least one of the remote cache directory and the castout buffer.
14. The device of claim 12, wherein the coherency logic is further configured to:
receive, from the processing core, a request to access data associated with a memory location;
examine the remote cache directory for an entry matching an address targeted by the request with a valid coherency state;
examine the castout buffer for an entry matching the address targeted by the request with a valid coherency state; and
if an entry matching the address targeted by the request is not found in either the remote cache directory or castout buffer, route the request to a memory controller to access the requested data from memory without sending a request to the processor.
15. The device of claim 14, wherein:
the device further comprises a pending transaction buffer; and
the coherency logic is further configured to route a request to the processor if an entry matching the address targeted by the request is found in the remote cache directory and create an entry in the pending transaction buffer containing the address targeted by the request.
16. The device of claim 15, wherein the memory controller resides on the remote device.
17. A coherent system, comprising:
a processor having a cache for storing data accessed from external memory, a cache directory with entries indicating which memory locations are stored in cache lines of the cache and corresponding coherency states thereof, and control logic configured to detect internal bus transactions indicating the allocation and de-allocation of cache lines and, in response, generate bus transactions, each containing cache coherency information indicating cache line that has been allocated or de-allocated; and
a remote device having a remote cache directory indicative of contents of the cache residing on the processor, a castout buffer indicating cache lines that have been or will be castout from the processor cache, and coherency logic configured to:
update the remote cache directory, based on cache coherency information contained in the external bus transactions generated by the processor control logic, to reflect allocated and de-allocated cache lines of the processor cache;
receive a memory access request issued by a graphics processing core;
search the remote cache directory and castout buffer for entries matching an address targeted by the request; and
if no matching entries are found, route the request to external memory without sending a request to the processor.
18. The system of claim 17, wherein the coherency logic is further configured to:
if a matching entry is found, route the request to the processor; and
create an entry in an outstanding transaction buffer containing the address targeted by the request.
19. The system of claim 18, wherein the coherency logic is further configured to:
copy an entry from the outstanding transaction buffer to the castout buffer, in response to receiving coherency information from the processor indicating a corresponding cache line has been or will be cast out from the cache.
US11/848,960 2004-10-08 2007-08-31 Snoop filter directory mechanism in coherency shared memory system Abandoned US20070294481A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/848,960 US20070294481A1 (en) 2004-10-08 2007-08-31 Snoop filter directory mechanism in coherency shared memory system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/961,749 US7305524B2 (en) 2004-10-08 2004-10-08 Snoop filter directory mechanism in coherency shared memory system
US11/848,960 US20070294481A1 (en) 2004-10-08 2007-08-31 Snoop filter directory mechanism in coherency shared memory system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/961,749 Continuation US7305524B2 (en) 2004-10-08 2004-10-08 Snoop filter directory mechanism in coherency shared memory system

Publications (1)

Publication Number Publication Date
US20070294481A1 true US20070294481A1 (en) 2007-12-20

Family

ID=36146739

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/961,749 Expired - Fee Related US7305524B2 (en) 2004-10-08 2004-10-08 Snoop filter directory mechanism in coherency shared memory system
US11/848,960 Abandoned US20070294481A1 (en) 2004-10-08 2007-08-31 Snoop filter directory mechanism in coherency shared memory system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/961,749 Expired - Fee Related US7305524B2 (en) 2004-10-08 2004-10-08 Snoop filter directory mechanism in coherency shared memory system

Country Status (1)

Country Link
US (2) US7305524B2 (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005485A1 (en) * 2006-06-29 2008-01-03 Gilbert Jeffrey D Exclusive ownership snoop filter
GB2460337A (en) * 2008-05-30 2009-12-02 Intel Corp Reducing back invalidation transactions from a snoop filter
US20100100682A1 (en) * 2008-10-22 2010-04-22 International Business Machines Corporation Victim Cache Replacement
US20100100683A1 (en) * 2008-10-22 2010-04-22 International Business Machines Corporation Victim Cache Prefetching
US20100153647A1 (en) * 2008-12-16 2010-06-17 International Business Machines Corporation Cache-To-Cache Cast-In
US20100153650A1 (en) * 2008-12-16 2010-06-17 International Business Machines Corporation Victim Cache Line Selection
US20100235576A1 (en) * 2008-12-16 2010-09-16 International Business Machines Corporation Handling Castout Cache Lines In A Victim Cache
US20100262784A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Empirically Based Dynamic Control of Acceptance of Victim Cache Lateral Castouts
US20100262778A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Empirically Based Dynamic Control of Transmission of Victim Cache Lateral Castouts
US20100262782A1 (en) * 2009-04-08 2010-10-14 International Business Machines Corporation Lateral Castout Target Selection
US20100262783A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Mode-Based Castout Destination Selection
US20110173392A1 (en) * 2010-01-08 2011-07-14 International Business Machines Corporation Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution
US20110202731A1 (en) * 2010-01-15 2011-08-18 International Business Machines Corporation Cache within a cache
US20110219187A1 (en) * 2010-01-15 2011-09-08 International Business Machines Corporation Cache directory lookup reader set encoding for partial cache line speculation support
WO2012024090A2 (en) * 2010-08-20 2012-02-23 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
US20120069035A1 (en) * 2010-09-20 2012-03-22 Qualcomm Incorporated Inter-processor communication techniques in a multiple-processor computing platform
US8489819B2 (en) 2008-12-19 2013-07-16 International Business Machines Corporation Victim cache lateral castout targeting
US8489822B2 (en) 2010-11-23 2013-07-16 Intel Corporation Providing a directory cache for peripheral devices
CN104298621A (en) * 2008-11-13 2015-01-21 英特尔公司 Shared virtual memory
US8949540B2 (en) 2009-03-11 2015-02-03 International Business Machines Corporation Lateral castout (LCO) of victim cache line in data-invalid state
US9189403B2 (en) 2009-12-30 2015-11-17 International Business Machines Corporation Selective cache-to-cache lateral castouts
US9411731B2 (en) * 2012-11-21 2016-08-09 Amazon Technologies, Inc. System and method for managing transactions
US20160321179A1 (en) * 2015-04-30 2016-11-03 Arm Limited Enforcing data protection in an interconnect
CN108206937A (en) * 2016-12-20 2018-06-26 浙江宇视科技有限公司 A kind of method and apparatus for promoting intellectual analysis performance

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7305524B2 (en) * 2004-10-08 2007-12-04 International Business Machines Corporation Snoop filter directory mechanism in coherency shared memory system
US7502895B2 (en) * 2005-09-13 2009-03-10 Hewlett-Packard Development Company, L.P. Techniques for reducing castouts in a snoop filter
US9035959B2 (en) 2008-03-28 2015-05-19 Intel Corporation Technique to share information among different cache coherency domains
US9058272B1 (en) 2008-04-25 2015-06-16 Marvell International Ltd. Method and apparatus having a snoop filter decoupled from an associated cache and a buffer for replacement line addresses
US8347035B2 (en) * 2008-12-18 2013-01-01 Intel Corporation Posting weakly ordered transactions
US8195880B2 (en) * 2009-04-15 2012-06-05 International Business Machines Corporation Information handling system with immediate scheduling of load operations in a dual-bank cache with dual dispatch into write/read data flow
US8140756B2 (en) * 2009-04-15 2012-03-20 International Business Machines Corporation Information handling system with immediate scheduling of load operations and fine-grained access to cache memory
US10489293B2 (en) 2009-04-15 2019-11-26 International Business Machines Corporation Information handling system with immediate scheduling of load operations
US8140765B2 (en) * 2009-04-15 2012-03-20 International Business Machines Corporation Information handling system with immediate scheduling of load operations in a dual-bank cache with single dispatch into write/read data flow
US8615637B2 (en) * 2009-09-10 2013-12-24 Advanced Micro Devices, Inc. Systems and methods for processing memory requests in a multi-processor system using a probe engine
US9400695B2 (en) * 2010-02-26 2016-07-26 Microsoft Technology Licensing, Llc Low latency rendering of objects
US8364904B2 (en) 2010-06-21 2013-01-29 International Business Machines Corporation Horizontal cache persistence in a multi-compute node, symmetric multiprocessing computer
US8856456B2 (en) 2011-06-09 2014-10-07 Apple Inc. Systems, methods, and devices for cache block coherence
US9373182B2 (en) * 2012-08-17 2016-06-21 Intel Corporation Memory sharing via a unified memory architecture
US9405351B2 (en) 2012-12-17 2016-08-02 Intel Corporation Performing frequency coordination in a multiprocessor system
US9292468B2 (en) * 2012-12-17 2016-03-22 Intel Corporation Performing frequency coordination in a multiprocessor system based on response timing optimization
US20140281234A1 (en) * 2013-03-12 2014-09-18 Advanced Micro Devices, Inc. Serving memory requests in cache coherent heterogeneous systems
US10157133B2 (en) 2015-12-10 2018-12-18 Arm Limited Snoop filter for cache coherency in a data processing system
US9900260B2 (en) 2015-12-10 2018-02-20 Arm Limited Efficient support for variable width data channels in an interconnect network
US9990292B2 (en) * 2016-06-29 2018-06-05 Arm Limited Progressive fine to coarse grain snoop filter
CN107292809B (en) * 2016-07-22 2020-10-09 珠海医凯电子科技有限公司 Method for realizing ultrasonic signal filtering processing by GPU
US10042766B1 (en) 2017-02-02 2018-08-07 Arm Limited Data processing apparatus with snoop request address alignment and snoop response time alignment
US10795820B2 (en) * 2017-02-08 2020-10-06 Arm Limited Read transaction tracker lifetimes in a coherent interconnect system
CN111913892B (en) * 2019-05-09 2021-12-07 北京忆芯科技有限公司 Providing open channel storage devices using CMBs
CN112463652B (en) * 2020-11-20 2022-09-27 海光信息技术股份有限公司 Data processing method and device based on cache consistency, processing chip and server
CN112612726B (en) * 2020-12-08 2022-09-27 海光信息技术股份有限公司 Data storage method and device based on cache consistency, processing chip and server
US11947418B2 (en) 2022-06-22 2024-04-02 International Business Machines Corporation Remote access array
US20240020027A1 (en) * 2022-07-14 2024-01-18 Samsung Electronics Co., Ltd. Systems and methods for managing bias mode switching

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4136386A (en) * 1977-10-06 1979-01-23 International Business Machines Corporation Backing store access coordination in a multi-processor system
US5113514A (en) * 1989-08-22 1992-05-12 Prime Computer, Inc. System bus for multiprocessor computer system
US5291442A (en) * 1990-10-31 1994-03-01 International Business Machines Corporation Method and apparatus for dynamic cache line sectoring in multiprocessor systems
US5581705A (en) * 1993-12-13 1996-12-03 Cray Research, Inc. Messaging facility with hardware tail pointer and software implemented head pointer message queue for distributed memory massively parallel processing system
US5841973A (en) * 1996-03-13 1998-11-24 Cray Research, Inc. Messaging in distributed memory multiprocessing system having shell circuitry for atomic control of message storage queue's tail pointer structure in local memory
US5875462A (en) * 1995-12-28 1999-02-23 Unisys Corporation Multi-processor data processing system with multiple second level caches mapable to all of addressable memory
US5890217A (en) * 1995-03-20 1999-03-30 Fujitsu Limited Coherence apparatus for cache of multiprocessor
US6067611A (en) * 1998-06-30 2000-05-23 International Business Machines Corporation Non-uniform memory access (NUMA) data processing system that buffers potential third node transactions to decrease communication latency
US6078992A (en) * 1997-12-05 2000-06-20 Intel Corporation Dirty line cache
US6092173A (en) * 1996-08-08 2000-07-18 Fujitsu Limited Multiprocessor, memory accessing method for multiprocessor, transmitter and receiver in data transfer system, data transfer system, and bus control method for data transfer system
US6124868A (en) * 1998-03-24 2000-09-26 Ati Technologies, Inc. Method and apparatus for multiple co-processor utilization of a ring buffer
US6363438B1 (en) * 1999-02-03 2002-03-26 Sun Microsystems, Inc. Method of controlling DMA command buffer for holding sequence of DMA commands with head and tail pointers
US6449699B2 (en) * 1999-03-29 2002-09-10 International Business Machines Corporation Apparatus and method for partitioned memory protection in cache coherent symmetric multiprocessor systems
US6725296B2 (en) * 2001-07-26 2004-04-20 International Business Machines Corporation Apparatus and method for managing work and completion queues using head and tail pointers
US20040117592A1 (en) * 2002-12-12 2004-06-17 International Business Machines Corporation Memory management for real-time applications
US20040162946A1 (en) * 2003-02-13 2004-08-19 International Business Machines Corporation Streaming data using locking cache
US6801208B2 (en) * 2000-12-27 2004-10-05 Intel Corporation System and method for cache sharing
US6801207B1 (en) * 1998-10-09 2004-10-05 Advanced Micro Devices, Inc. Multimedia processor employing a shared CPU-graphics cache
US6820143B2 (en) * 2002-12-17 2004-11-16 International Business Machines Corporation On-chip data transfer in multi-processor system
US6820174B2 (en) * 2002-01-18 2004-11-16 International Business Machines Corporation Multi-processor computer system using partition group directories to maintain cache coherence
US20040263519A1 (en) * 2003-06-30 2004-12-30 Microsoft Corporation System and method for parallel execution of data generation tasks
US7305524B2 (en) * 2004-10-08 2007-12-04 International Business Machines Corporation Snoop filter directory mechanism in coherency shared memory system

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4136386A (en) * 1977-10-06 1979-01-23 International Business Machines Corporation Backing store access coordination in a multi-processor system
US5113514A (en) * 1989-08-22 1992-05-12 Prime Computer, Inc. System bus for multiprocessor computer system
US5291442A (en) * 1990-10-31 1994-03-01 International Business Machines Corporation Method and apparatus for dynamic cache line sectoring in multiprocessor systems
US5581705A (en) * 1993-12-13 1996-12-03 Cray Research, Inc. Messaging facility with hardware tail pointer and software implemented head pointer message queue for distributed memory massively parallel processing system
US5890217A (en) * 1995-03-20 1999-03-30 Fujitsu Limited Coherence apparatus for cache of multiprocessor
US5875462A (en) * 1995-12-28 1999-02-23 Unisys Corporation Multi-processor data processing system with multiple second level caches mapable to all of addressable memory
US5841973A (en) * 1996-03-13 1998-11-24 Cray Research, Inc. Messaging in distributed memory multiprocessing system having shell circuitry for atomic control of message storage queue's tail pointer structure in local memory
US6092173A (en) * 1996-08-08 2000-07-18 Fujitsu Limited Multiprocessor, memory accessing method for multiprocessor, transmitter and receiver in data transfer system, data transfer system, and bus control method for data transfer system
US6078992A (en) * 1997-12-05 2000-06-20 Intel Corporation Dirty line cache
US6124868A (en) * 1998-03-24 2000-09-26 Ati Technologies, Inc. Method and apparatus for multiple co-processor utilization of a ring buffer
US6067611A (en) * 1998-06-30 2000-05-23 International Business Machines Corporation Non-uniform memory access (NUMA) data processing system that buffers potential third node transactions to decrease communication latency
US6801207B1 (en) * 1998-10-09 2004-10-05 Advanced Micro Devices, Inc. Multimedia processor employing a shared CPU-graphics cache
US6363438B1 (en) * 1999-02-03 2002-03-26 Sun Microsystems, Inc. Method of controlling DMA command buffer for holding sequence of DMA commands with head and tail pointers
US6449699B2 (en) * 1999-03-29 2002-09-10 International Business Machines Corporation Apparatus and method for partitioned memory protection in cache coherent symmetric multiprocessor systems
US6801208B2 (en) * 2000-12-27 2004-10-05 Intel Corporation System and method for cache sharing
US6725296B2 (en) * 2001-07-26 2004-04-20 International Business Machines Corporation Apparatus and method for managing work and completion queues using head and tail pointers
US6820174B2 (en) * 2002-01-18 2004-11-16 International Business Machines Corporation Multi-processor computer system using partition group directories to maintain cache coherence
US20040117592A1 (en) * 2002-12-12 2004-06-17 International Business Machines Corporation Memory management for real-time applications
US6820143B2 (en) * 2002-12-17 2004-11-16 International Business Machines Corporation On-chip data transfer in multi-processor system
US20040162946A1 (en) * 2003-02-13 2004-08-19 International Business Machines Corporation Streaming data using locking cache
US20040263519A1 (en) * 2003-06-30 2004-12-30 Microsoft Corporation System and method for parallel execution of data generation tasks
US7305524B2 (en) * 2004-10-08 2007-12-04 International Business Machines Corporation Snoop filter directory mechanism in coherency shared memory system

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7581068B2 (en) * 2006-06-29 2009-08-25 Intel Corporation Exclusive ownership snoop filter
US20080005485A1 (en) * 2006-06-29 2008-01-03 Gilbert Jeffrey D Exclusive ownership snoop filter
GB2460337B (en) * 2008-05-30 2010-12-15 Intel Corp Reducing back invalidaton transactions from a snoop filter
GB2460337A (en) * 2008-05-30 2009-12-02 Intel Corp Reducing back invalidation transactions from a snoop filter
US20090300289A1 (en) * 2008-05-30 2009-12-03 Tsvika Kurts Reducing back invalidation transactions from a snoop filter
US8015365B2 (en) 2008-05-30 2011-09-06 Intel Corporation Reducing back invalidation transactions from a snoop filter
US20100100682A1 (en) * 2008-10-22 2010-04-22 International Business Machines Corporation Victim Cache Replacement
US20100100683A1 (en) * 2008-10-22 2010-04-22 International Business Machines Corporation Victim Cache Prefetching
US8209489B2 (en) 2008-10-22 2012-06-26 International Business Machines Corporation Victim cache prefetching
US8347037B2 (en) 2008-10-22 2013-01-01 International Business Machines Corporation Victim cache replacement
CN104298621A (en) * 2008-11-13 2015-01-21 英特尔公司 Shared virtual memory
US20100153650A1 (en) * 2008-12-16 2010-06-17 International Business Machines Corporation Victim Cache Line Selection
US20100235576A1 (en) * 2008-12-16 2010-09-16 International Business Machines Corporation Handling Castout Cache Lines In A Victim Cache
US8225045B2 (en) 2008-12-16 2012-07-17 International Business Machines Corporation Lateral cache-to-cache cast-in
US20100153647A1 (en) * 2008-12-16 2010-06-17 International Business Machines Corporation Cache-To-Cache Cast-In
US8117397B2 (en) 2008-12-16 2012-02-14 International Business Machines Corporation Victim cache line selection
US8499124B2 (en) 2008-12-16 2013-07-30 International Business Machines Corporation Handling castout cache lines in a victim cache
US8489819B2 (en) 2008-12-19 2013-07-16 International Business Machines Corporation Victim cache lateral castout targeting
US8949540B2 (en) 2009-03-11 2015-02-03 International Business Machines Corporation Lateral castout (LCO) of victim cache line in data-invalid state
US8285939B2 (en) 2009-04-08 2012-10-09 International Business Machines Corporation Lateral castout target selection
US20100262782A1 (en) * 2009-04-08 2010-10-14 International Business Machines Corporation Lateral Castout Target Selection
US20100262784A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Empirically Based Dynamic Control of Acceptance of Victim Cache Lateral Castouts
US20100262783A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Mode-Based Castout Destination Selection
US8347036B2 (en) 2009-04-09 2013-01-01 International Business Machines Corporation Empirically based dynamic control of transmission of victim cache lateral castouts
US8327073B2 (en) 2009-04-09 2012-12-04 International Business Machines Corporation Empirically based dynamic control of acceptance of victim cache lateral castouts
US8312220B2 (en) 2009-04-09 2012-11-13 International Business Machines Corporation Mode-based castout destination selection
US20100262778A1 (en) * 2009-04-09 2010-10-14 International Business Machines Corporation Empirically Based Dynamic Control of Transmission of Victim Cache Lateral Castouts
US9189403B2 (en) 2009-12-30 2015-11-17 International Business Machines Corporation Selective cache-to-cache lateral castouts
US20110173392A1 (en) * 2010-01-08 2011-07-14 International Business Machines Corporation Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution
US20110208894A1 (en) * 2010-01-08 2011-08-25 International Business Machines Corporation Physical aliasing for thread level speculation with a speculation blind cache
US9501333B2 (en) 2010-01-08 2016-11-22 International Business Machines Corporation Multiprocessor system with multiple concurrent modes of execution
US8838906B2 (en) 2010-01-08 2014-09-16 International Business Machines Corporation Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution
US8832415B2 (en) 2010-01-08 2014-09-09 International Business Machines Corporation Mapping virtual addresses to different physical addresses for value disambiguation for thread memory access requests
US20110219381A1 (en) * 2010-01-15 2011-09-08 International Business Machines Corporation Multiprocessor system with multiple concurrent modes of execution
US20110219191A1 (en) * 2010-01-15 2011-09-08 International Business Machines Corporation Reader set encoding for directory of shared cache memory in multiprocessor system
US8868837B2 (en) 2010-01-15 2014-10-21 International Business Machines Corporation Cache directory lookup reader set encoding for partial cache line speculation support
US8533399B2 (en) 2010-01-15 2013-09-10 International Business Machines Corporation Cache directory look-up re-use as conflict check mechanism for speculative memory requests
US8621478B2 (en) 2010-01-15 2013-12-31 International Business Machines Corporation Multiprocessor system with multiple concurrent modes of execution
US20110219187A1 (en) * 2010-01-15 2011-09-08 International Business Machines Corporation Cache directory lookup reader set encoding for partial cache line speculation support
US8751748B2 (en) 2010-01-15 2014-06-10 International Business Machines Corporation Reader set encoding for directory of shared cache memory in multiprocessor system
US20110219215A1 (en) * 2010-01-15 2011-09-08 International Business Machines Corporation Atomicity: a multi-pronged approach
US20110202731A1 (en) * 2010-01-15 2011-08-18 International Business Machines Corporation Cache within a cache
US8918592B2 (en) 2010-08-20 2014-12-23 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
US8656115B2 (en) 2010-08-20 2014-02-18 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
WO2012024090A3 (en) * 2010-08-20 2012-06-07 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
WO2012024090A2 (en) * 2010-08-20 2012-02-23 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
US9298629B2 (en) 2010-08-20 2016-03-29 Intel Corporation Extending a cache coherency snoop broadcast protocol with directory information
US20120069035A1 (en) * 2010-09-20 2012-03-22 Qualcomm Incorporated Inter-processor communication techniques in a multiple-processor computing platform
US8937622B2 (en) * 2010-09-20 2015-01-20 Qualcomm Incorporated Inter-processor communication techniques in a multiple-processor computing platform
US9645866B2 (en) 2010-09-20 2017-05-09 Qualcomm Incorporated Inter-processor communication techniques in a multiple-processor computing platform
US9626234B2 (en) 2010-09-20 2017-04-18 Qualcomm Incorporated Inter-processor communication techniques in a multiple-processor computing platform
US8489822B2 (en) 2010-11-23 2013-07-16 Intel Corporation Providing a directory cache for peripheral devices
US9411731B2 (en) * 2012-11-21 2016-08-09 Amazon Technologies, Inc. System and method for managing transactions
US10061700B1 (en) 2012-11-21 2018-08-28 Amazon Technologies, Inc. System and method for managing transactions
US10635589B2 (en) 2012-11-21 2020-04-28 Amazon Technologies, Inc. System and method for managing transactions
US20160321179A1 (en) * 2015-04-30 2016-11-03 Arm Limited Enforcing data protection in an interconnect
US10078589B2 (en) * 2015-04-30 2018-09-18 Arm Limited Enforcing data protection in an interconnect
CN108206937A (en) * 2016-12-20 2018-06-26 浙江宇视科技有限公司 A kind of method and apparatus for promoting intellectual analysis performance

Also Published As

Publication number Publication date
US7305524B2 (en) 2007-12-04
US20060080508A1 (en) 2006-04-13

Similar Documents

Publication Publication Date Title
US7305524B2 (en) Snoop filter directory mechanism in coherency shared memory system
US7577794B2 (en) Low latency coherency protocol for a multi-chip multiprocessor system
KR100545951B1 (en) Distributed read and write caching implementation for optimized input/output applications
US5996048A (en) Inclusion vector architecture for a level two cache
US7546422B2 (en) Method and apparatus for the synchronization of distributed caches
US7032074B2 (en) Method and mechanism to use a cache to translate from a virtual bus to a physical bus
US20060080511A1 (en) Enhanced bus transactions for efficient support of a remote cache directory copy
US5325504A (en) Method and apparatus for incorporating cache line replacement and cache write policy information into tag directories in a cache system
US7237068B2 (en) Computer system employing bundled prefetching and null-data packet transmission
JP3434462B2 (en) Allocation release method and data processing system
JPH07253928A (en) Duplex cache snoop mechanism
US11687457B2 (en) Hardware coherence for memory controller
WO2001050274A1 (en) Cache line flush micro-architectural implementation method and system
KR100613817B1 (en) Method and apparatus for the utilization of distributed caches
US8332592B2 (en) Graphics processor with snoop filter
US20040148471A1 (en) Multiprocessing computer system employing capacity prefetching
US20060179173A1 (en) Method and system for cache utilization by prefetching for multiple DMA reads
EP0533373A2 (en) Computer system having cache memory
JPH06208507A (en) Cache memory system
JPH10232831A (en) Cache tag maintaining device
GB2401227A (en) Cache line flush instruction and method

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: BANK OF AMERICA, N.A., NEW YORK

Free format text: SECURITY AGREEMENT;ASSIGNORS:NORTEK, INC.;AIGIS MECHTRONICS, INC.;BROAN-MEXICO HOLDINGS, INC.;AND OTHERS;REEL/FRAME:023750/0040

Effective date: 20091217

Owner name: BANK OF AMERICA, N.A.,NEW YORK

Free format text: SECURITY AGREEMENT;ASSIGNORS:NORTEK, INC.;AIGIS MECHTRONICS, INC.;BROAN-MEXICO HOLDINGS, INC.;AND OTHERS;REEL/FRAME:023750/0040

Effective date: 20091217

AS Assignment

Owner name: AVC GROUP, LLC, THE, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:XANTECH LLC;REEL/FRAME:025940/0279

Effective date: 20110301

AS Assignment

Owner name: CORE BRANDS, LLC, CALIFORNIA

Free format text: MERGER AND CHANGE OF NAME;ASSIGNOR:AVC GROUP, LLC, THE;REEL/FRAME:029708/0262

Effective date: 20130101

AS Assignment

Owner name: CLEANPAK INTERNATIONAL, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: ZEPHYR VENTILATION, LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: GEFEN, LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: GOVERNAIR CORPORATION, MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: MAMMOTH-WEBCO, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: XANTECH LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: INTERNATIONAL ELECTRONICS, LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NUTONE LLC, WISCONSIN

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: RANGAIRE LP, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NORDYNE INTERNATIONAL, INC., FLORIDA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: SPEAKERCRAFT, LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: LINEAR LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NORTEK INTERNATIONAL, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: OMNIMOUNT SYSTEMS, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: PANAMAX LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: MAGENTA RESEARCH LTD., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: TEMTROL, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NORDYNE LLC, MISSOURI

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: HUNTAIR, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: RANGAIRE GP, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: CES GROUP, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NORTEK, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: PACIFIC ZEPHYR RANGE HOOD, INC., CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: HC INSTALLATIONS, INC., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: GATES THAT OPEN, LLC, FLORIDA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: OPERATOR SPECIALTY COMPANY, INC., MICHIGAN

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: BROAN-NUTONE STORAGE SOLUTIONS LP, WISCONSIN

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: AIGIS MECHTRONICS, INC., CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: BROAN-MEXICO HOLDINGS, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: BROAN-NUTONE LLC, WISCONSIN

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: NILES AUDIO CORPORATION, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: SECURE WIRELESS, INC., CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: LITE TOUCH, INC., RHODE ISLAND

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: ELAN HOME SYSTEMS, L.L.C., CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831

Owner name: CES INTERNATIONAL LTD., MINNESOTA

Free format text: TERMINATION AND RELEASE OF SECURITY IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:041327/0089

Effective date: 20160831