WO2001016737A2 - Cache-coherent shared-memory cluster - Google Patents

Cache-coherent shared-memory cluster Download PDF

Info

Publication number
WO2001016737A2
WO2001016737A2 PCT/US2000/024147 US0024147W WO0116737A2 WO 2001016737 A2 WO2001016737 A2 WO 2001016737A2 US 0024147 W US0024147 W US 0024147W WO 0116737 A2 WO0116737 A2 WO 0116737A2
Authority
WO
WIPO (PCT)
Prior art keywords
cache
memory
shared
workstation
cea
Prior art date
Application number
PCT/US2000/024147
Other languages
French (fr)
Other versions
WO2001016737A3 (en
Inventor
Lynn Parker West
Ted Scardamalia
Original Assignee
Times N Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Times N Systems, Inc. filed Critical Times N Systems, Inc.
Priority to AU74742/00A priority Critical patent/AU7474200A/en
Publication of WO2001016737A2 publication Critical patent/WO2001016737A2/en
Publication of WO2001016737A3 publication Critical patent/WO2001016737A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/0284Multiple user address space allocation, e.g. using different base addresses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/457Communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0837Cache consistency protocols with software control, e.g. non-cacheable data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/52Indexing scheme relating to G06F9/52
    • G06F2209/523Mode

Definitions

  • the invention relates generally to the field of computer systems based on multiple processors and shared memory. More particularly, the invention relates to computer systems that utilize a cache-coherent shared-memory cluster.
  • the clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files. In more-closely-coupled environments, some software packages (such as
  • NQS allow a cluster of workstations to share work.
  • the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.
  • message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation.
  • MPP Massively Parallel Processor
  • a highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.
  • Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features. What is needed is a better approach to parallel processor systems.
  • the processors shared a single copy of the operating system.
  • the problem with such systems is that they cannot be efficiently scaled beyond four to eight way systems except in unusual circumstances. All known cases of said unusual circumstances are such that the systems are not good price-performance systems for general -purpose computing.
  • U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT/USOO/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes.
  • U.S. Ser. No. 09/273,430 improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster.
  • the concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information.
  • 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus.
  • the advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency. What is needed are improvements to the concept of the tight cluster.
  • PCs personal computers
  • workstations use caches except for the very low end of the PC business where caches are omitted for price reasons and performance is, therefore, poor.
  • Caches however, present a problem for shared-memory computing systems; the problem of coherence.
  • a particular processor reads or writes a word of shared memory
  • That word and usually a number of surrounding words are transferred to that particular processor's cache memory transparently by cache-memory hardware. That word and the surrounding words (if any) are transferred into a portion of the particular processor's cache memory that is called a cache line or cache block.
  • the representation in the cache memory will become different from the value in shared memory.
  • That cache line within that particular processor's cache memory is, at that point, called a "dirty" line.
  • the particular processor with the dirty line when accessing that memory address will see the new (modified) value.
  • Other processors accessing that memory address will see the old (unmodified) value in shared memory. This lack of coherence between such accesses will lead to incorrect results.
  • Modern computers, workstations, and PCs which provide for multiple processors and shared memory, therefore, also provide high-speed, transparent cache coherence hardware to assure that if a line in one cache changes and another processor subsequently accesses a value which is in that address range, the new values will be transferred back to memory or at least to the requesting processor.
  • Caches can be maintained coherent by software provided that sufficient cache-management instructions are provided by the manufacturer. However, in many cases, an adequate arsenal of such instructions are not provided.
  • a goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied.
  • One embodiment of the invention is based on an apparatus, comprising: a central shared memory unit; a first processing node coupled to said central shared memory unit via a first interconnection; and a second processing node coupled to said central shared memory unit via a second interconnection.
  • Another embodiment of the invention is based on a method, comprising: recording memory accesses by a workstation to a shared memory unit (SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation.
  • Another embodiment of the invention is based on an electronic media, comprising: a computer program adapted to record memory accesses by a workstation to a shared memory unit
  • SMU by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation.
  • Another embodiment of the invention is based on a system, comprising a multiplicity of workstations and a shared memory unit (SMU) interconnected and arranged such that memory accesses by a given workstation in a set of address ranges will be to its local, private memory and that memory accesses to a second set of address ranges will be to shared memory and arranged such that accesses to the shared memory unit are recorded and signaled by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the given workstation.
  • SMU shared memory unit
  • Another embodiment of the invention is based on a system, comprising: multiple processing nodes and one or more shared memory nodes; each processing node containing some memory which is local and private to that node, and each shared memory node containing some other memory which is visible and usable by all nodes; and a local cache system which provides caching for a first memory address region, and one or more shared-memory cache systems which provide caching for one or more other address regions.
  • Another embodiment of the invention is based on a system, comprising multiple processing nodes; each processing node containing some memory which is local and private to that node, and a portion of other memory which is visible and usable by all nodes.
  • FIG. 1 illustrates a block schematic view of a share-as-needed system, representing an embodiment of the invention.
  • FIG. 2 illustrates a block schematic view of a cache and directory for the system of FIG. 1.
  • FIG. 3 illustrates a block schematic view of another cache and directory for the system of FIG. 1.
  • FIG. 4 illustrates a block schematic view of a share-as-needed system with distributed, shared memory, representing an embodiment of the invention.
  • FIG. 5 illustrates a block schematic view of a share-as-needed system distributed memory node, representing an embodiment of the invention. DESCRIPTION OF PREFERRED EMBODIMENTS The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description of preferred embodiments. Descriptions of well known components and processing techniques are omitted so as not to unnecessarily obscure the invention in detail.
  • U.S. Ser. No. 09/273,430 include a system which is a single entity; one large supercomputer.
  • the invention is also applicable to a cluster of workstations, or even a network.
  • the invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system.
  • the invention is also applicable to other types of multiple processing node systems.
  • a tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors.
  • accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory.
  • the significant advantage to a tight cluster in comparison to a message-passing cluster is that, assuming the environment has been appropriately established, the exchange of information involves a single
  • the invention can be embodied as a system in which the shared memory is distributed within the workstations of the cluster.
  • the shared memory can be distributed evenly or unevenly.
  • the invention can include a computer system in which there are multiple processing nodes, for which each node is provided with some local (private) memory and for which some memory is provided which is visible to all processing nodes.
  • the invention can include means for keeping the shared memory coherent with caches in the processing nodes.
  • U.S. Ser. No. 09/273,430 a system is described in which a multiplicity of processing nodes, each with private, local memory, have equal access to a shared pool of memory via a standard bus.
  • the preferred embodiment utilizes PCI adapters, resident on the PCI bus of each processing node for means to access the shared memory.
  • This invention teaches multiple mechanisms based on that system (herein called a Processor Team) for keeping the caches within the processing nodes coherent with the shared memory.
  • the invention can be described in terms of a set-associative cache using the AMESI protocol.
  • the AMESI protocol for reference, stands for Absent, Modified, Exclusive, Shared and Invalid. Absent and Invalid are reported differently by the cache upon interrogation, but the action taken is the same, so the description hereafter will refer only to MESI.
  • a cache interrogation results in Invalid, then the cache does not have the data and the interrogator reacts appropriate to that response. If a cache interrogation results in Exclusive, then the cache has the only valid copy of the data and the interrogator reacts appropriately to that response. If a cache interrogation results in Shared, then the cache has one valid copy of the data and the interrogator reacts appropriately to that response. If a cache interrogation results in a Modified, then the cache has the only valid copy of the data and that copy is different from the copy in memory; the interrogator reacts appropriately to that response.
  • interrogators and responses are also well understood in the art: there are two different kinds of interrogators: (1) the CPU needing access to data at that particular cache-line address; and (2) agents from other CPU's needing the data, after finding the data is not present in their local cache, searching all caches for the data. It should be mentioned that item (2) above is slightly simplified: sometimes an agent is generated for some other cause.
  • the interrogator action is rather simple if the data is present: get the data or modify the data. If the cache line is marked S and the CPU access is Store, then an agent is created to assure other caches are kept coherent. Similar results, well understood in the prior art, occur for other CPU actions and for the various agent actions, relative to each state in which a cache is for the addressed cache line.
  • cache directories are well understood in the prior art.
  • a standard example of a directory is one in which, if a cache line is not present in the local cache, the directory can be accessed for that cache-line address and will yield, via a bit pattern, which other caches may have copies of the cache line, the state in each of the other caches, and other similar information.
  • the present invention contemplates such an environment in which each free-standing computer in provided with very high-speed, low-latency communication means to a central shared-memory unit (SMU) which contains the memory shared among the collection of workstations which comprise the tight cluster.
  • SMU shared-memory unit
  • Each free-standing computer e.g., workstation
  • SMU shared-memory unit
  • Each free-standing computer is also provided with a specific interconnection to the SMU, said interface being the novel teaching of this invention.
  • One possible interface to the SMU would be a memory-alias interface.
  • Such an interface would be to design an interface that appeared to the programmer (and to the processor) to be a memory adapter (usually called a memory card).
  • a memory card would not in reality be a memory card, but would be responsive to accesses (LOADS and STORES) across a range of memory addresses, said range being a range designed a priori to be the range in which shared memory resides. Within said range, the memory card would translate accesses into requests and responses to the shared memory unit where the actual shared data would be stored into and retrieved from.
  • Such a memory-alias card or adapter is of value and workable only if there is no cache or if software- managed cache coherence were used.
  • the present invention describes a different adapter within the workstation PC which not only provides an interface to the interconnection to the SMU but which also provides hardware cache coherence for accesses to shared memory.
  • Most PCs and workstations which provide for multiple processors (SMPs) provide a bus to which each data cache has access via which each "snoops" the bus activity of other processors. If a processor requires a memory access which may involve cache coherence issues, that access is first pre-signaled on the snooped bus. Any other cache controller for which the operation may cause loss of coherence signals such potential loss to the originating source and operations which preserve coherence then are executed. This invention relates to such PCs and workstations.
  • the invention can include an interface within the PC or workstation which emulates a cache including cache-coherence control mechanisms, and fully emulates those data and control signals on the snooped bus, but which is not a cache nor is it the usual SMP companion processor found behind such caches, but rather is an interconnect interface to the communication link to the SMU.
  • the invented adapter is hereafter called the cache emulation adapter (CEA).
  • the invention can include assuring that caches within the processing nodes are kept coherent with the shared memory.
  • the system within which the invention operates can include a multiplicity of workstations (which may be PCs) and including a shared memory unit (SMU) in which the addressable scope of memory within each workstation is subdivided such that one, or more address ranges, are local memory within each workstation which is/are not addressable by other workstations and one, or more, other address ranges the memory which is shared by all workstations.
  • a multiplicity of workstations which may be PCs
  • SMU shared memory unit
  • the SMU provides means for signaling to each CEA when a shared-memory cache line has been accessed by any CEA. All CEAs therefore keep a hashed, partial directory indicating which shared-memory cache lines may need coherence operations. For most memory accesses to memory, the originating CPU can determine coherence status within its own cache and cache controls so no external snooping action is required. When cache coherence (snooping action) is required, and if the address range is to private memory, the CEA will respond with the NAK response.
  • the CEA will detect the request and will hold off (some protocols require it to signal HOLD OFF to the CPU, which it will do if appropriate) and signal the SMU which will, in turn, flush or invalidate the cache line in the owning cache (if any) and send the cache line to the requesting workstation.
  • the CEA may not be provided with a hashed, partial directory. In this case, all snoops to global memory will require CEA-SMU transactions to invalidate the needed cache line or to determine that it is clean.
  • each CEA can be provided with a full directory of all shared lines cached in all workstations and can obtain the required cache line directly, notifying other CEAs to update ownership information for the cache line.
  • the CEA can be provided with caching to improve speed of response to shared memory at any workstation.
  • the cache would be much larger than the typical CPU cache and would cache much larger blocks: memory pages, for example.
  • the shared memory is accessed in units hereafter referred to as pages.
  • the operating system in use is then configured to mark the pages as non-cacheable. Then, when a particular PRN accesses a shared page, the data from that page is retrieved from shared memory, and, being non-cacheable, is not placed into the cache of the PRN. Similarly, when data is written to that page, it is not written into the cache of that PRN by virtue of being marked non-cacheable. Therefore, any other processor subsequently accessing that data will see the same data, achieving the goal of keeping all caches coherent with respect to shared data.
  • FIG. 1 shows such a memory access means, and shows the processing nodes and shared-memory node for a share-as-needed embodiment.
  • a plurality of processing nodes 101 are coupled to a shared-memory node 102 with processing node to shared-memory node links 103.
  • Figure 2 shows a portion of a system similar to the system of figure 1.
  • Figure 2 shows a processing node 201 coupled to a shared-memory node 202 with a processing node to shared-memory node line 203.
  • the processing node 201 includes a cache 204 for the caching of shared memory data.
  • This cache 204 is not the same cache as is used for local memory, but rather is kept separate. The normal local-memory cache can be present without skirting the teachings of this invention.
  • the processing node 201 also includes a directory 205.
  • the directory 205 is the directory indicating the contents of cache 205.
  • the preferred embodiment of this invention uses the MESI protocol and set- associative cache structure. Of course, the invention will work with other protocols and/or structures.
  • FIG. 3 shows a preferred embodiment of this invention.
  • each processing node 301 is provided with a cache 304 for shared- memory accesses.
  • a single directory 305 is provided at the shared memory node 302 for resolving cache coherence issues.
  • the processing node 301 is coupled to the shared-memory node 302 with a processing node to shared- memory node link 303.
  • the processing node 301 accesses a shared-memory location and the cache 304 at that node contains the data in a state compatible with the type of access, the access is resolved at the processing node 301 and operation continues.
  • Directory 305 then updates its entry for that cache line and takes the other action consistent with the original access type at the original processing node 301.
  • the invention includes the concept of a "parallel" cache, independent of another cache.
  • level- 1 caches are taught, as are level- 1 and level-2 combinations, and higher levels.
  • the data in any given cache level is a superset or a subset of the data in lower-level caches.
  • the invention can include a parallel cache system.
  • the first cache system may consist of a level- 1 cache only, or of several levels of cache.
  • the second cache system in this invention has data orthogonal to the first cache system.
  • the first cache system caches data in a first address range
  • the second cache system caches data in a second address range.
  • Figure 4 shows a share-as-needed system in which that portion of memory which is shared is distributed across a multiplicity of nodes rather than residing in a single, separate shared-memory node.
  • a plurality of nodes 401 are interconnected with a node-node line 402.
  • Figure 5 shows a representative node that is similar to the nodes shown in figure 4.
  • a processing node 501 is provided with local memory 502 and with shared memory 506.
  • the node 501 is provided with a processor 504, a local -memory cache 503 and a shared-memory cache and directory 505.
  • the node 501 also includes a shared-memory partition 506, a shared-memory interconnect 507 and shared-memory links 508.
  • the caches operate as described previously.
  • the directory 505 is actually a portion of the system- wide shared-memory directory and communicates to other portions of said directory which are distributed within the other nodes. Such a distributed directory is taught in the prior art.
  • NUMA machines teach the concept of distributed memory. In NUMA machines, however, all memory is accessible by all processors in the system, and there is a single copy of the operating system running across all processors.
  • the invention can include a system comprising some memory which is private and local to the node, and on which is running a local copy of the operating system; and which includes some shared memory, visible to all processors, and, in such a system, this invention teaches the concept of distributing the shared memory across the nodes. While not being limited to any particular performance indicator or diagnostic identifier, preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance. The test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment.
  • substantially is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1% of, and most preferably within 0.1% of).
  • coupled as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically.
  • means as used herein, is defined as hardware, firmware and/or software for achieving a result.
  • program or phrase computer program as used herein, is defined as a sequence of instructions designed for execution on a computer system.
  • a program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system.
  • a system representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons.
  • the invention improves the speed of parallel computing systems.
  • the invention improves the scalability of parallel computing systems.

Abstract

Methods, systems and devices are described for a cache-coherent shared-memory cluster. A system includes a multiplicity of workstations and a shared memory unit (SMU) interconnected and arranged such that memory accesses by a given workstation in a set of address ranges will be to its local, private memory and that memory accesses to a second set of address ranges will be to shared memory and arranged such that accesses to the shared memory unit are recorded and signaled by a cache-emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the given workstation. The methods, systems and devices provide advantages because the speed and scalability of parallel processor systems is enhanced.

Description

CACHE-COHERENT SHARED-MEMORY CLUSTER
BACKGROUND OF THE INVENTION 1. Field of the Invention
The invention relates generally to the field of computer systems based on multiple processors and shared memory. More particularly, the invention relates to computer systems that utilize a cache-coherent shared-memory cluster. 2. Discussion of the Related Art
The clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files. In more-closely-coupled environments, some software packages (such as
NQS) allow a cluster of workstations to share work. In such cases the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.
In both of these cases, and all other known cases of clustering, the operating system and cluster subsystem are built around the concept of message-passing. The term message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation. Another well-known art is that of clustering processors within a machine, usually called a Massively Parallel Processor or MPP, in which the techniques are essentially identical to those of clustered workstations. Usually, the bandwidth and latency of the interconnect network of an MPP are more highly optimized, but the system operation is the same. In the general case, the passing of a message is an extremely expensive operation; expensive in the sense that many CPU cycles in the sender and receiver are consumed by the process of sending, receiving, bracketing, verifying, and routing the message, CPU cycles that are therefore not available for other operations. A highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.
There are specific cases wherein the passing of a message requires significantly less overhead. However, none of these specific cases is adaptable to a general-purpose computer system.
Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features. What is needed is a better approach to parallel processor systems.
There are alternatives to the passing of messages for closely-coupled cluster work. One such alternative is the use of shared memory for inter- processor communication.
Shared-memory systems, have been much more successful at capturing market share than message-passing systems because of the dramatically superior performance of shared-memory systems, up to about four-processor systems. In Search of Clusters, Gregory F. Pfister 2nd ed. (January 1998)
Prentice Hall Computer Books, ISBN: 0138997098 describes a computing system with multiple processing nodes in which each processing node is provided with private, local memory and also has access to a range of memory which is shared with other processing nodes. The disclosure of this publication in its entirety is hereby expressly incorporated herein by reference for the purpose of indicating the background of the invention and illustrating the state of the art.
However, providing high availability for traditional shared-memory systems has proved to be an elusive goal. The nature of these systems, which share all code and all data, including that data which controls the shared operating systems, is incompatible with the separation normally required for high availability. What is needed is an approach to shared-memory systems that improves availability.
Although the use of shared memory for inter-processor communication is a well-known art, prior to the teachings of U.S. Ser. No. 09/273,430, filed March 19, 1999, entitled Shared Memory Apparatus and Method for
Multiprocessing Systems, the processors shared a single copy of the operating system. The problem with such systems is that they cannot be efficiently scaled beyond four to eight way systems except in unusual circumstances. All known cases of said unusual circumstances are such that the systems are not good price-performance systems for general -purpose computing.
The entire contents of U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT/USOO/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes. U.S. Ser. No. 09/273,430, improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster. The concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information. U.S. Ser. No. 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus. The advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency. What is needed are improvements to the concept of the tight cluster.
What is also needed is an expansion of the concept of the tight cluster.
Another well-known art is the use of memory caches to improve performance. Caches provide such a significant performance boost that most modern computers use them. At the very top of the performance (and price) range all of memory is constructed using cache-memory technologies.
However, this is such an expensive approach that few manufacturers use it. All manufacturers of personal computers (PCs) and workstations use caches except for the very low end of the PC business where caches are omitted for price reasons and performance is, therefore, poor.
Caches, however, present a problem for shared-memory computing systems; the problem of coherence. As a particular processor reads or writes a word of shared memory, that word and usually a number of surrounding words are transferred to that particular processor's cache memory transparently by cache-memory hardware. That word and the surrounding words (if any) are transferred into a portion of the particular processor's cache memory that is called a cache line or cache block. If the transferred cache line is modified by the particular processor, the representation in the cache memory will become different from the value in shared memory. That cache line within that particular processor's cache memory is, at that point, called a "dirty" line. The particular processor with the dirty line, when accessing that memory address will see the new (modified) value. Other processors, accessing that memory address will see the old (unmodified) value in shared memory. This lack of coherence between such accesses will lead to incorrect results.
Modern computers, workstations, and PCs which provide for multiple processors and shared memory, therefore, also provide high-speed, transparent cache coherence hardware to assure that if a line in one cache changes and another processor subsequently accesses a value which is in that address range, the new values will be transferred back to memory or at least to the requesting processor.
Caches can be maintained coherent by software provided that sufficient cache-management instructions are provided by the manufacturer. However, in many cases, an adequate arsenal of such instructions are not provided.
Moreover, even in cases where the instruction set is adequate, the software overhead is so great that no examples of are known of commercially successful machines which use software-managed coherence. Thus, the existing hardware and software cache coherency approaches are unsatisfactory. What is also needed, therefore, is a better approach to cache coherency. SUMMARY OF THE INVENTION A goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied.
One embodiment of the invention is based on an apparatus, comprising: a central shared memory unit; a first processing node coupled to said central shared memory unit via a first interconnection; and a second processing node coupled to said central shared memory unit via a second interconnection. Another embodiment of the invention is based on a method, comprising: recording memory accesses by a workstation to a shared memory unit (SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation. Another embodiment of the invention is based on an electronic media, comprising: a computer program adapted to record memory accesses by a workstation to a shared memory unit
(SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation. Another embodiment of the invention is based on a system, comprising a multiplicity of workstations and a shared memory unit (SMU) interconnected and arranged such that memory accesses by a given workstation in a set of address ranges will be to its local, private memory and that memory accesses to a second set of address ranges will be to shared memory and arranged such that accesses to the shared memory unit are recorded and signaled by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the given workstation. Another embodiment of the invention is based on a system, comprising: multiple processing nodes and one or more shared memory nodes; each processing node containing some memory which is local and private to that node, and each shared memory node containing some other memory which is visible and usable by all nodes; and a local cache system which provides caching for a first memory address region, and one or more shared-memory cache systems which provide caching for one or more other address regions. Another embodiment of the invention is based on a system, comprising multiple processing nodes; each processing node containing some memory which is local and private to that node, and a portion of other memory which is visible and usable by all nodes.
These, and other goals and embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating preferred embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS A clear conception of the advantages and features constituting the invention, and of the components and operation of model systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings accompanying and forming a part of this specification, wherein like reference characters (if they occur in more than one view) designate the same parts. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
FIG. 1 illustrates a block schematic view of a share-as-needed system, representing an embodiment of the invention. FIG. 2 illustrates a block schematic view of a cache and directory for the system of FIG. 1.
FIG. 3 illustrates a block schematic view of another cache and directory for the system of FIG. 1.
FIG. 4 illustrates a block schematic view of a share-as-needed system with distributed, shared memory, representing an embodiment of the invention.
FIG. 5 illustrates a block schematic view of a share-as-needed system distributed memory node, representing an embodiment of the invention. DESCRIPTION OF PREFERRED EMBODIMENTS The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description of preferred embodiments. Descriptions of well known components and processing techniques are omitted so as not to unnecessarily obscure the invention in detail.
The teachings of U.S. Ser. No. 09/273,430 include a system which is a single entity; one large supercomputer. The invention is also applicable to a cluster of workstations, or even a network.
The invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system. The invention is also applicable to other types of multiple processing node systems.
The context of the invention can include a tight cluster as described in U.S. Ser. No. 09/273,430. A tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors. Within the scope of a given processor, accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory. The significant advantage to a tight cluster in comparison to a message-passing cluster is that, assuming the environment has been appropriately established, the exchange of information involves a single
STORE instruction by the sending processor and a subsequent single LOAD instruction by the receiving processor.
The establishment of the environment, taught by U.S. Ser. No. 09/273,430 and more fully by companion disclosures (U.S. Provisional Application Ser. No. 60/220,794, filed July 26, 2000; U.S. Provisional
Application Ser. No. 60/220,748, filed July 26, 2000; WSGR 15245-712; WSGR 15245-713; WSGR 15245-715; WSGR 15245-716; WSGR 15245-717; WSGR 15245-718; WSGR 15245-719; and WSGR 15245-720, the entire contents of all which are hereby expressly incorporated herein by reference for all purposes) can be performed in such a way as to require relatively little system overhead, and to be done once for many, many information exchanges. Therefore, a comparison of 10,000 instructions for message-passing to a pair of instructions for tight-clustering, is valid.
The invention can be embodied as a system in which the shared memory is distributed within the workstations of the cluster. The shared memory can be distributed evenly or unevenly. The invention can include a computer system in which there are multiple processing nodes, for which each node is provided with some local (private) memory and for which some memory is provided which is visible to all processing nodes. The invention can include means for keeping the shared memory coherent with caches in the processing nodes. In U.S. Ser. No. 09/273,430, a system is described in which a multiplicity of processing nodes, each with private, local memory, have equal access to a shared pool of memory via a standard bus. In that application, the preferred embodiment utilizes PCI adapters, resident on the PCI bus of each processing node for means to access the shared memory. This invention teaches multiple mechanisms based on that system (herein called a Processor Team) for keeping the caches within the processing nodes coherent with the shared memory.
The invention can be described in terms of a set-associative cache using the AMESI protocol. The AMESI protocol, for reference, stands for Absent, Modified, Exclusive, Shared and Invalid. Absent and Invalid are reported differently by the cache upon interrogation, but the action taken is the same, so the description hereafter will refer only to MESI.
In a MESI system, if a cache interrogation results in Invalid, then the cache does not have the data and the interrogator reacts appropriate to that response. If a cache interrogation results in Exclusive, then the cache has the only valid copy of the data and the interrogator reacts appropriately to that response. If a cache interrogation results in Shared, then the cache has one valid copy of the data and the interrogator reacts appropriately to that response. If a cache interrogation results in a Modified, then the cache has the only valid copy of the data and that copy is different from the copy in memory; the interrogator reacts appropriately to that response. The interrogators and responses are also well understood in the art: there are two different kinds of interrogators: (1) the CPU needing access to data at that particular cache-line address; and (2) agents from other CPU's needing the data, after finding the data is not present in their local cache, searching all caches for the data. It should be mentioned that item (2) above is slightly simplified: sometimes an agent is generated for some other cause.
For a local CPU access, the interrogator action is rather simple if the data is present: get the data or modify the data. If the cache line is marked S and the CPU access is Store, then an agent is created to assure other caches are kept coherent. Similar results, well understood in the prior art, occur for other CPU actions and for the various agent actions, relative to each state in which a cache is for the addressed cache line.
Similarly, cache directories are well understood in the prior art. A standard example of a directory is one in which, if a cache line is not present in the local cache, the directory can be accessed for that cache-line address and will yield, via a bit pattern, which other caches may have copies of the cache line, the state in each of the other caches, and other similar information.
The present invention contemplates such an environment in which each free-standing computer in provided with very high-speed, low-latency communication means to a central shared-memory unit (SMU) which contains the memory shared among the collection of workstations which comprise the tight cluster. Each free-standing computer (e.g., workstation) is also provided with a specific interconnection to the SMU, said interface being the novel teaching of this invention.
One possible interface to the SMU would be a memory-alias interface. Such an interface would be to design an interface that appeared to the programmer (and to the processor) to be a memory adapter (usually called a memory card). Such a card would not in reality be a memory card, but would be responsive to accesses (LOADS and STORES) across a range of memory addresses, said range being a range designed a priori to be the range in which shared memory resides. Within said range, the memory card would translate accesses into requests and responses to the shared memory unit where the actual shared data would be stored into and retrieved from. Such a memory-alias card or adapter is of value and workable only if there is no cache or if software- managed cache coherence were used.
The present invention describes a different adapter within the workstation PC which not only provides an interface to the interconnection to the SMU but which also provides hardware cache coherence for accesses to shared memory. Most PCs and workstations which provide for multiple processors (SMPs) provide a bus to which each data cache has access via which each "snoops" the bus activity of other processors. If a processor requires a memory access which may involve cache coherence issues, that access is first pre-signaled on the snooped bus. Any other cache controller for which the operation may cause loss of coherence signals such potential loss to the originating source and operations which preserve coherence then are executed. This invention relates to such PCs and workstations. There are many such coherence maintenance schemes and this invention does not teach nor describe in detail such schemes; they are well known to those skilled in the art. The schemes involve some kind of positive response (ACK) if cache coherence actions are required, and negative response (NAK) if the particular cache has no requirement for coherence actions. The present invention can be applied to any such scheme and for each will be designed to use the protocols of the particular scheme in use by the workstation.
The invention can include an interface within the PC or workstation which emulates a cache including cache-coherence control mechanisms, and fully emulates those data and control signals on the snooped bus, but which is not a cache nor is it the usual SMP companion processor found behind such caches, but rather is an interconnect interface to the communication link to the SMU. The invented adapter is hereafter called the cache emulation adapter (CEA).
In a system having some private, local memory in each of a multiplicity of processing nodes (PRNs) and some shared memory in one or more shared- memory nodes (SMNs), the invention can include assuring that caches within the processing nodes are kept coherent with the shared memory. The system within which the invention operates can include a multiplicity of workstations (which may be PCs) and including a shared memory unit (SMU) in which the addressable scope of memory within each workstation is subdivided such that one, or more address ranges, are local memory within each workstation which is/are not addressable by other workstations and one, or more, other address ranges the memory which is shared by all workstations.
In one embodiment of the invention, the SMU provides means for signaling to each CEA when a shared-memory cache line has been accessed by any CEA. All CEAs therefore keep a hashed, partial directory indicating which shared-memory cache lines may need coherence operations. For most memory accesses to memory, the originating CPU can determine coherence status within its own cache and cache controls so no external snooping action is required. When cache coherence (snooping action) is required, and if the address range is to private memory, the CEA will respond with the NAK response. For shared- memory accesses by the CPU (or CPUs) within a given workstation which require potential coherence action, the CEA will detect the request and will hold off (some protocols require it to signal HOLD OFF to the CPU, which it will do if appropriate) and signal the SMU which will, in turn, flush or invalidate the cache line in the owning cache (if any) and send the cache line to the requesting workstation.
In another embodiment, the CEA may not be provided with a hashed, partial directory. In this case, all snoops to global memory will require CEA-SMU transactions to invalidate the needed cache line or to determine that it is clean.
In another embodiment, each CEA can be provided with a full directory of all shared lines cached in all workstations and can obtain the required cache line directly, notifying other CEAs to update ownership information for the cache line.
For any of the embodiments above, the CEA can be provided with caching to improve speed of response to shared memory at any workstation. The cache would be much larger than the typical CPU cache and would cache much larger blocks: memory pages, for example.
In another embodiment of the invention, the shared memory is accessed in units hereafter referred to as pages. The operating system in use is then configured to mark the pages as non-cacheable. Then, when a particular PRN accesses a shared page, the data from that page is retrieved from shared memory, and, being non-cacheable, is not placed into the cache of the PRN. Similarly, when data is written to that page, it is not written into the cache of that PRN by virtue of being marked non-cacheable. Therefore, any other processor subsequently accessing that data will see the same data, achieving the goal of keeping all caches coherent with respect to shared data.
In another embodiment, which can improve overall system performance for certain data access patterns, the same methodology is used, but in the memory access means is included to provide a shared memory cache of the data more local (less latency) than the shared memory. Figure 1 shows such a memory access means, and shows the processing nodes and shared-memory node for a share-as-needed embodiment. A plurality of processing nodes 101 are coupled to a shared-memory node 102 with processing node to shared-memory node links 103.
Figure 2 shows a portion of a system similar to the system of figure 1. Figure 2 shows a processing node 201 coupled to a shared-memory node 202 with a processing node to shared-memory node line 203. The processing node 201 includes a cache 204 for the caching of shared memory data. This cache 204 is not the same cache as is used for local memory, but rather is kept separate. The normal local-memory cache can be present without skirting the teachings of this invention.
In figure 2, the processing node 201 also includes a directory 205. The directory 205 is the directory indicating the contents of cache 205. The preferred embodiment of this invention uses the MESI protocol and set- associative cache structure. Of course, the invention will work with other protocols and/or structures.
Figure 3 shows a preferred embodiment of this invention. In this embodiment, each processing node 301 is provided with a cache 304 for shared- memory accesses. A single directory 305 is provided at the shared memory node 302 for resolving cache coherence issues. The processing node 301 is coupled to the shared-memory node 302 with a processing node to shared- memory node link 303. In this preferred embodiment, if the processing node 301 accesses a shared-memory location and the cache 304 at that node contains the data in a state compatible with the type of access, the access is resolved at the processing node 301 and operation continues. If the cache 304 at that node does not contain the data or if it is in a state incompatible with the access, then a request agent is transferred to directory 305. Directory 305 then updates its entry for that cache line and takes the other action consistent with the original access type at the original processing node 301.
These consistent actions are common to directory-based coherence schemes as taught in the prior art, and include actions such as invalidating entries in other caches, changing the state of entries in other caches from "Exclusive" to "Shared" or removing Modified data from one cache and delivering that data to another cache.
The invention includes the concept of a "parallel" cache, independent of another cache. In the prior art, level- 1 caches are taught, as are level- 1 and level-2 combinations, and higher levels. For all known forms of these, the data in any given cache level is a superset or a subset of the data in lower-level caches.
The invention can include a parallel cache system. The first cache system may consist of a level- 1 cache only, or of several levels of cache. The second cache system in this invention has data orthogonal to the first cache system. In the preferred embodiment, the first cache system caches data in a first address range, whereas the second cache system caches data in a second address range. There may be more than a first and second cache parallel cache system provided, each responsive to a separate address range.
Figure 4 shows a share-as-needed system in which that portion of memory which is shared is distributed across a multiplicity of nodes rather than residing in a single, separate shared-memory node. A plurality of nodes 401 are interconnected with a node-node line 402.
Figure 5 shows a representative node that is similar to the nodes shown in figure 4. In figure 5, a processing node 501 is provided with local memory 502 and with shared memory 506. In addition, the node 501 is provided with a processor 504, a local -memory cache 503 and a shared-memory cache and directory 505. The node 501 also includes a shared-memory partition 506, a shared-memory interconnect 507 and shared-memory links 508.
The caches operate as described previously. The directory 505 is actually a portion of the system- wide shared-memory directory and communicates to other portions of said directory which are distributed within the other nodes. Such a distributed directory is taught in the prior art.
NUMA machines teach the concept of distributed memory. In NUMA machines, however, all memory is accessible by all processors in the system, and there is a single copy of the operating system running across all processors. The invention can include a system comprising some memory which is private and local to the node, and on which is running a local copy of the operating system; and which includes some shared memory, visible to all processors, and, in such a system, this invention teaches the concept of distributing the shared memory across the nodes. While not being limited to any particular performance indicator or diagnostic identifier, preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance. The test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment.
The term substantially, as used herein, is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1% of, and most preferably within 0.1% of). The term coupled, as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term means, as used herein, is defined as hardware, firmware and/or software for achieving a result. The term program or phrase computer program, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system.
Practical Applications of the Invention A practical application of the invention that has value within the technological arts is waveform transformation. Further, the invention is useful in conjunction with data input and transformation (such as are used for the purpose of speech recognition), or in conjunction with transforming the appearance of a display (such as are used for the purpose of video games), or the like. There are virtually innumerable uses for the invention, all of which need not be detailed here.
Advantages of the Invention A system, representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons. The invention improves the speed of parallel computing systems. The invention improves the scalability of parallel computing systems.
All the disclosed embodiments of the invention described herein can be realized and practiced without undue experimentation. Although the best mode of carrying out the invention contemplated by the inventors is disclosed above, practice of the invention is not limited thereto. Accordingly, it will be appreciated by those skilled in the art that the invention may be practiced otherwise than as specifically described herein. For example, although the cache-coherent shared-memory cluster described herein can be a separate module, it will be manifest that the cache- coherent shared-memory cluster may be integrated into the system with which it is associated. Furthermore, all the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive. It will be manifest that various additions, modifications and rearrangements of the features of the invention may be made without deviating from the spirit and scope of the underlying inventive concept. It is intended that the scope of the invention as defined by the appended claims and their equivalents cover all such additions, modifications, and rearrangements. The appended claims are not to be interpreted as including means-plus- function limitations, unless such a limitation is explicitly recited in a given claim using the phrase "means for." Expedient embodiments of the invention are differentiated by the appended subclaims.

Claims

CLAIMS What is claimed is:
1. An apparatus, comprising: a central shared memory unit; a first processing node coupled to said central shared memory unit via a first interconnection; and a second processing node coupled to said central shared memory unit via a second interconnection.
2. The apparatus of claim 1, wherein said first interconnection includes a first interface and said second interconnection includes a second interface.
3. The apparatus of claim 2, wherein said first interface includes a first cache emulation adapter and said second interconnection includes a second cache emulation adapter.
4. A computer system, comprising the apparatus of claim 1.
5. An apparatus, comprising a first cache system and a second cache system, wherein the second cache system has data orthogonal to the first cache system.
6. A computer system, comprising the apparatus of claim 5.
7. A method, comprising: recording memory accesses by a workstation to a shared memory unit (SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation.
8. The method of claim 7, further comprising signaling to another CEA when a shared-memory cache line has been accessed.
9. The method of claim 8, further comprising invalidating a cache line by a
CEA-SMU transaction.
10. The method of claim 8, further comprising determining that a cache line is clean by a CEA-SMU transaction.
11. An electronic media, comprising: a computer program adapted to record memory accesses by a workstation to a shared memory unit (SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache- coherence signals within the workstation.
12. A computer program comprising computer program means adapted to perform the step of recording memory accesses by a workstation to a shared memory unit (SMU) by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the workstation when said computer program is run on a computer.
13. A computer program as claimed in claim 12, embodied on a computer- readable medium.
14. A system, comprising a multiplicity of workstations and a shared memory unit (SMU) interconnected and arranged such that memory accesses by a given workstation in a set of address ranges will be to its local, private memory and that memory accesses to a second set of address ranges will be to shared memory and arranged such that accesses to the shared memory unit are recorded and signaled by a cache emulator adapter (CEA) capable of recognizing and responding to cache-coherence signals within the given workstation.
15. The system of claim 14, further comprising means for signaling between the CEA units and the SMU sufficient to satisfy cache coherence operations across the system when said cache coherence operations involve shared memory.
16. The system of claim 14, in which the SMU of said system includes a directory for keeping track of which workstation has which cache line.
17. The system of claim 14, in which said SMU cache directory includes means for keeping track of the ownership status of each workstation-owned cache line (READ SHARED, READ EXCLUSIVE, WRITE).
18. The system of claim 14, in which each CEA includes a directory of which workstation has which cache line.
19. The system of claim 14, in which said CEA cache directory includes means for keeping track of the ownership status of each workstation-owned cache line (READ SHARED, READ EXCLUSIVE, WRITE).
20. The system of claim 14, in which said CEA includes caching of shared- memory accesses.
21. A system, comprising: multiple processing nodes and one or more shared memory nodes; each processing node containing some memory which is local and private to that node, and each shared memory node containing some other memory which is visible and usable by all nodes; and a local cache system which provides caching for a first memory address region, and one or more shared-memory cache systems which provide caching for one or more other address regions.
22. A system, comprising multiple processing nodes; each processing node containing some memory which is local and private to that node, and a portion of other memory which is visible and usable by all nodes.
23. The system of claim 22, in which a local cache system provides the caching function for a first memory address region, and one or more shard- memory cache systems which provide the caching function for one or more other memory address regions.
PCT/US2000/024147 1999-08-31 2000-08-31 Cache-coherent shared-memory cluster WO2001016737A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU74742/00A AU7474200A (en) 1999-08-31 2000-08-31 Cache-coherent shared-memory cluster

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US15215199P 1999-08-31 1999-08-31
US60/152,151 1999-08-31
US60/220,794 2000-07-25
US22097400P 2000-07-26 2000-07-26
US22074800P 2000-07-26 2000-07-26
US60/220,748 2000-07-26

Publications (2)

Publication Number Publication Date
WO2001016737A2 true WO2001016737A2 (en) 2001-03-08
WO2001016737A3 WO2001016737A3 (en) 2001-11-08

Family

ID=27387201

Family Applications (9)

Application Number Title Priority Date Filing Date
PCT/US2000/024217 WO2001016741A2 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory
PCT/US2000/024210 WO2001016740A2 (en) 1999-08-31 2000-08-31 Efficient event waiting
PCT/US2000/024150 WO2001016738A2 (en) 1999-08-31 2000-08-31 Efficient page ownership control
PCT/US2000/024298 WO2001016743A2 (en) 1999-08-31 2000-08-31 Shared memory disk
PCT/US2000/024039 WO2001016760A1 (en) 1999-08-31 2000-08-31 Switchable shared-memory cluster
PCT/US2000/024147 WO2001016737A2 (en) 1999-08-31 2000-08-31 Cache-coherent shared-memory cluster
PCT/US2000/024216 WO2001016761A2 (en) 1999-08-31 2000-08-31 Efficient page allocation
PCT/US2000/024329 WO2001016750A2 (en) 1999-08-31 2000-08-31 High-availability, shared-memory cluster
PCT/US2000/024248 WO2001016742A2 (en) 1999-08-31 2000-08-31 Network shared memory

Family Applications Before (5)

Application Number Title Priority Date Filing Date
PCT/US2000/024217 WO2001016741A2 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory
PCT/US2000/024210 WO2001016740A2 (en) 1999-08-31 2000-08-31 Efficient event waiting
PCT/US2000/024150 WO2001016738A2 (en) 1999-08-31 2000-08-31 Efficient page ownership control
PCT/US2000/024298 WO2001016743A2 (en) 1999-08-31 2000-08-31 Shared memory disk
PCT/US2000/024039 WO2001016760A1 (en) 1999-08-31 2000-08-31 Switchable shared-memory cluster

Family Applications After (3)

Application Number Title Priority Date Filing Date
PCT/US2000/024216 WO2001016761A2 (en) 1999-08-31 2000-08-31 Efficient page allocation
PCT/US2000/024329 WO2001016750A2 (en) 1999-08-31 2000-08-31 High-availability, shared-memory cluster
PCT/US2000/024248 WO2001016742A2 (en) 1999-08-31 2000-08-31 Network shared memory

Country Status (4)

Country Link
EP (3) EP1214653A2 (en)
AU (9) AU7108500A (en)
CA (3) CA2382728A1 (en)
WO (9) WO2001016741A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003007134A1 (en) * 2001-07-13 2003-01-23 Koninklijke Philips Electronics N.V. Method of running a media application and a media system with job control
US6920485B2 (en) 2001-10-04 2005-07-19 Hewlett-Packard Development Company, L.P. Packet processing in shared memory multi-computer systems
US6999998B2 (en) 2001-10-04 2006-02-14 Hewlett-Packard Development Company, L.P. Shared memory coupling of network infrastructure devices

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7254745B2 (en) 2002-10-03 2007-08-07 International Business Machines Corporation Diagnostic probe management in data processing systems
JP2008046969A (en) * 2006-08-18 2008-02-28 Fujitsu Ltd Access monitoring method and device for shared memory
US7685381B2 (en) 2007-03-01 2010-03-23 International Business Machines Corporation Employing a data structure of readily accessible units of memory to facilitate memory access
US7899663B2 (en) 2007-03-30 2011-03-01 International Business Machines Corporation Providing memory consistency in an emulated processing environment
US9442780B2 (en) * 2011-07-19 2016-09-13 Qualcomm Incorporated Synchronization of shader operation
US9064437B2 (en) * 2012-12-07 2015-06-23 Intel Corporation Memory based semaphores
WO2014190486A1 (en) * 2013-05-28 2014-12-04 华为技术有限公司 Method and system for supporting resource isolation under multi-core architecture

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994011824A1 (en) * 1992-11-16 1994-05-26 International Business Machines Corporation Multi-processor computer system
US5590308A (en) * 1993-09-01 1996-12-31 International Business Machines Corporation Method and apparatus for reducing false invalidations in distributed systems
US5636359A (en) * 1994-06-20 1997-06-03 International Business Machines Corporation Performance enhancement system and method for a hierarchical data cache using a RAID parity scheme
US5829029A (en) * 1996-12-18 1998-10-27 Bull Hn Information Systems Inc. Private cache miss and access management in a multiprocessor system with shared memory
EP0908825A1 (en) * 1997-10-10 1999-04-14 BULL HN INFORMATION SYSTEMS ITALIA S.p.A. A data-processing system with cc-NUMA (cache coherent, non-uniform memory access) architecture and remote access cache incorporated in local memory
US5940870A (en) * 1996-05-21 1999-08-17 Industrial Technology Research Institute Address translation for shared-memory multiprocessor clustering

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3668644A (en) * 1970-02-09 1972-06-06 Burroughs Corp Failsafe memory system
US4484262A (en) * 1979-01-09 1984-11-20 Sullivan Herbert W Shared memory computer method and apparatus
US4403283A (en) * 1980-07-28 1983-09-06 Ncr Corporation Extended memory system and method
US4414624A (en) * 1980-11-19 1983-11-08 The United States Of America As Represented By The Secretary Of The Navy Multiple-microcomputer processing
US4725946A (en) * 1985-06-27 1988-02-16 Honeywell Information Systems Inc. P and V instructions for semaphore architecture in a multiprogramming/multiprocessing environment
JPH063589B2 (en) * 1987-10-29 1994-01-12 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Address replacement device
US5175839A (en) * 1987-12-24 1992-12-29 Fujitsu Limited Storage control system in a computer system for double-writing
EP0343646B1 (en) * 1988-05-26 1995-12-13 Hitachi, Ltd. Task execution control method for a multiprocessor system with enhanced post/wait procedure
US4992935A (en) * 1988-07-12 1991-02-12 International Business Machines Corporation Bit map search by competitive processors
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
EP0457308B1 (en) * 1990-05-18 1997-01-22 Fujitsu Limited Data processing system having an input/output path disconnecting mechanism and method for controlling the data processing system
US5206952A (en) * 1990-09-12 1993-04-27 Cray Research, Inc. Fault tolerant networking architecture
US5434970A (en) * 1991-02-14 1995-07-18 Cray Research, Inc. System for distributed multiprocessor communication
JPH04271453A (en) * 1991-02-27 1992-09-28 Toshiba Corp Composite electronic computer
EP0528538B1 (en) * 1991-07-18 1998-12-23 Tandem Computers Incorporated Mirrored memory multi processor system
US5315707A (en) * 1992-01-10 1994-05-24 Digital Equipment Corporation Multiprocessor buffer system
US5398331A (en) * 1992-07-08 1995-03-14 International Business Machines Corporation Shared storage controller for dual copy shared data
US5434975A (en) * 1992-09-24 1995-07-18 At&T Corp. System for interconnecting a synchronous path having semaphores and an asynchronous path having message queuing for interprocess communications
JP2963298B2 (en) * 1993-03-26 1999-10-18 富士通株式会社 Recovery method of exclusive control instruction in duplicated shared memory and computer system
US5664089A (en) * 1994-04-26 1997-09-02 Unisys Corporation Multiple power domain power loss detection and interface disable
US6587889B1 (en) * 1995-10-17 2003-07-01 International Business Machines Corporation Junction manager program object interconnection and method
US5784699A (en) * 1996-05-24 1998-07-21 Oracle Corporation Dynamic memory allocation in a computer using a bit map index
JPH10142298A (en) * 1996-11-15 1998-05-29 Advantest Corp Testing device for ic device
US5918248A (en) * 1996-12-30 1999-06-29 Northern Telecom Limited Shared memory control algorithm for mutual exclusion and rollback
US6360303B1 (en) * 1997-09-30 2002-03-19 Compaq Computer Corporation Partitioning memory shared by multiple processors of a distributed processing system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994011824A1 (en) * 1992-11-16 1994-05-26 International Business Machines Corporation Multi-processor computer system
US5590308A (en) * 1993-09-01 1996-12-31 International Business Machines Corporation Method and apparatus for reducing false invalidations in distributed systems
US5636359A (en) * 1994-06-20 1997-06-03 International Business Machines Corporation Performance enhancement system and method for a hierarchical data cache using a RAID parity scheme
US5940870A (en) * 1996-05-21 1999-08-17 Industrial Technology Research Institute Address translation for shared-memory multiprocessor clustering
US5829029A (en) * 1996-12-18 1998-10-27 Bull Hn Information Systems Inc. Private cache miss and access management in a multiprocessor system with shared memory
EP0908825A1 (en) * 1997-10-10 1999-04-14 BULL HN INFORMATION SYSTEMS ITALIA S.p.A. A data-processing system with cc-NUMA (cache coherent, non-uniform memory access) architecture and remote access cache incorporated in local memory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"DISTRIBUTED PROCESS BULLETIN BOARD" IBM TECHNICAL DISCLOSURE BULLETIN,US,IBM CORP. NEW YORK, vol. 33, no. 10A, 1 March 1991 (1991-03-01), pages 1-5, XP000109937 ISSN: 0018-8689 *
LENOSKI D ET AL: "THE DIRECTORY-BASED CACHE COHERENCE PROTOCOL FOR THE DASH MULTIPROCESSOR" COMPUTER ARCHITECTURE NEWS,US,ASSOCIATION FOR COMPUTING MACHINERY, NEW YORK, vol. 18, no. 2, 1 June 1990 (1990-06-01), pages 148-159, XP000134792 ISSN: 0163-5964 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003007134A1 (en) * 2001-07-13 2003-01-23 Koninklijke Philips Electronics N.V. Method of running a media application and a media system with job control
US6920485B2 (en) 2001-10-04 2005-07-19 Hewlett-Packard Development Company, L.P. Packet processing in shared memory multi-computer systems
US6999998B2 (en) 2001-10-04 2006-02-14 Hewlett-Packard Development Company, L.P. Shared memory coupling of network infrastructure devices

Also Published As

Publication number Publication date
WO2001016741A2 (en) 2001-03-08
AU7110000A (en) 2001-03-26
EP1214651A2 (en) 2002-06-19
EP1214652A2 (en) 2002-06-19
AU7112100A (en) 2001-03-26
AU6949700A (en) 2001-03-26
WO2001016740A3 (en) 2001-12-27
AU7474200A (en) 2001-03-26
WO2001016750A2 (en) 2001-03-08
EP1214653A2 (en) 2002-06-19
WO2001016750A3 (en) 2002-01-17
WO2001016738A8 (en) 2001-05-03
WO2001016740A2 (en) 2001-03-08
WO2001016738A9 (en) 2002-09-12
CA2382929A1 (en) 2001-03-08
WO2001016743A8 (en) 2001-10-18
AU7100700A (en) 2001-03-26
WO2001016743A3 (en) 2001-08-09
WO2001016737A3 (en) 2001-11-08
WO2001016743A2 (en) 2001-03-08
CA2382728A1 (en) 2001-03-08
WO2001016760A1 (en) 2001-03-08
WO2001016742A2 (en) 2001-03-08
WO2001016742A3 (en) 2001-09-20
CA2382927A1 (en) 2001-03-08
AU7108500A (en) 2001-03-26
WO2001016761A2 (en) 2001-03-08
AU7113600A (en) 2001-03-26
WO2001016738A3 (en) 2001-10-04
AU6949600A (en) 2001-03-26
WO2001016741A3 (en) 2001-09-20
WO2001016761A3 (en) 2001-12-27
WO2001016738A2 (en) 2001-03-08
AU7108300A (en) 2001-03-26

Similar Documents

Publication Publication Date Title
Archibald et al. Cache coherence protocols: Evaluation using a multiprocessor simulation model
US6990559B2 (en) Mechanism for resolving ambiguous invalidates in a computer system
US6018791A (en) Apparatus and method of maintaining cache coherency in a multi-processor computer system with global and local recently read states
Scales et al. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory
EP0817071B9 (en) A multiprocessing system configured to detect and efficiently provide for migratory data access patterns
US5652859A (en) Method and apparatus for handling snoops in multiprocessor caches having internal buffer queues
CN1575455B (en) Distributed read and write caching implementation for optimized input/output applications
US6405289B1 (en) Multiprocessor system in which a cache serving as a highest point of coherency is indicated by a snoop response
EP0817073B1 (en) A multiprocessing system configured to perform efficient write operations
CN101097545B (en) Exclusive ownership snoop filter
EP0818733B1 (en) A multiprocessing system configured to perform software initiated prefetch operations
US7003635B2 (en) Generalized active inheritance consistency mechanism having linked writes
US20050188159A1 (en) Computer system supporting both dirty-shared and non dirty-shared data processing entities
US7958314B2 (en) Target computer processor unit (CPU) determination during cache injection using input/output I/O) hub/chipset resources
JPH10187645A (en) Multiprocess system constituted for storage in many subnodes of process node in coherence state
US7051163B2 (en) Directory structure permitting efficient write-backs in a shared memory computer system
WO2001016737A2 (en) Cache-coherent shared-memory cluster
Hill et al. Cache considerations for multiprocessor programmers
US6892290B2 (en) Linked-list early race resolution mechanism
US7000080B2 (en) Channel-based late race resolution mechanism for a computer system
US6965972B2 (en) Real time emulation of coherence directories using global sparse directories
JP2000067009A (en) Main storage shared type multi-processor
US6658536B1 (en) Cache-coherency protocol with recently read state for extending cache horizontally
US6895476B2 (en) Retry-based late race resolution mechanism for a computer system
US20040064655A1 (en) Memory access statistics tool

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)