ACTIVE DYNAMIC RANDOM ACCESS MEMORY
BACKGROUND OF THE INVENTION
1. FIELD OF THE INVENTION
This invention relates to computer systems, and more specifically to DRAM memory architectures.
All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
2. BACKGROUND ART
Computers and computer networks have become mainstays in business and education environments. The increasing affordability of computer systems and the ease and accessability of the Internet have fueled the growth of computers in home environments as well. Advances in integrated circuit design, processing and manufacturing continue to push the performance (e.g., speed and capacity) of computer systems to new levels to meet the growing needs of the computing community. Unfortunately, there is a widening performance gap between processors and memory that serves as a bottleneck to overall system performance.
When applications are executed by the computer, the application instructions (also referred to as "code") and data are loaded into DRAM modules
where the instructions and data may be accessed by the processor(s) as needed. Processor wait states or no-ops (null instructions) must be inserted into the processor execution stream to accommodate the latency between when a processor first requests data from memory and when the requested data is returned from memory. These added wait states reduce the average number of instructions that may be executed per unit of time regardless of the processor speed. Memory access latency is, at least in part, a product of the physical separation between the processor and the memory device. Memory access is constrained by the bandwidth of the system bus, and is further limited by the need for extra interface circuitry on the processor and memory device, such as signal level translation circuits, to support off-chip bus communication.
In present computer systems, the execution unit of a computer typically comprises one or more processors (also referred to as microprocessors) coupled to one or more DRAM (dynamic random access memory) modules via address and data busses. A processor contains instruction decoders and logic units for carrying out operations on data in accordance with the instructions of a program. These operations include reading data from memory, writing data to memory, and processing data (e.g., with Boolean logic and arithmetic circuits). The DRAM modules contain addressable memory cells for storage of application data for use by the processor.
The processors and DRAM modules comprise separate integrated circuit chips (ICs) connected directly or indirectly to a common electronic substrate, such as a circuit board (e.g., a "motherboard"). The processor IC is commonly inserted into a socket in the motherboard. A DRAM module is commonly provided as a SIMM (single in-line memory module) or DIMM (dual in-line memory module), or a variation thereof, that supports one or more DRAM ICs
on a small circuit board that is inserted, via an edge connector, into the motherboard. A SIMM module provides identical I/O on both sides of the edge connector, whereas a DIMM module provides separate I/O on each side of the edge connector. Computers typically support multiple memory modules for greater capacity.
Each DRAM IC comprises an array of memory cells. Each memory cell stores one bit of data (i.e., a "0" or a "1"). A DRAM IC may be one dimensional, i.e., configured with one bit per address, or the DRAM IC may be configured with multiple bits per address. Typically, no more than sixteen bits of a data word are stored in a single DRAM IC. For example, a DRAM IC with sixteen megabits of storage may be configured as 16MB x 1 (224 one-bit values), 8MB x 2 (223 two-bit values), 4MB x 4 (222 four-bit values), 2MB x 8 (221 8-bit values), or 1MB x 16 (220 16-bit values). A SIMM might contain eight one-bit wide DRAM ICs (plus an additional one-bit wide DRAM IC for parity in some systems). In this case, each byte of data is split among the DRAM ICs (e.g., with one bit stored in each).
SIMM and DIMM access operations are asynchronous in nature, and a separate data address is provided for each access operation. Each access operation may take several bus cycles to complete, even when performed within the same memory page. Synchronous memory technologies, such as synchronous DRAM (SDRAM) and Rambus DRAM (RDRAM), have been developed to improve memory performance for sequential access. In SDRAM, memory access is synchronized with a bus clock, and a configuration register is provided to specify how many consecutive words to access. The standard delay is incurred in accessing the first addressed word, but the specified number of consecutive data words are accessed automatically by the SDRAM module in
consecutive clock cycles. The configuration register may be set to specify access for one data word (e.g., one or two bytes), two data words, four data words, etc., up to a full page of data at a time.
For example, with the configuration register set to access four words, a memory access operation at location X might take five bus clock cycles, but locations X+l, X+2 and X+3 would be output on the following first, second and third bus clock cycles, assuming the access was performed within the bounds of a single page. This type of access is denoted as 5/1/1/1, referring to an initial latency of five bus clock cycles for the first data word, and a delay of one bus clock cycle for each of the following three words. In contrast, a SIMM or DIMM might have performance on the order of 5/3/3/3 for four consecutive data words. For truly random access (i.e., non-consecutive data words), the initial latency value (e.g., 5) would still apply to each data word.
Even with SDRAM, the minimum data access time is limited by the bus speed, i.e., the period of the bus clock cycle. For example, a common bus speed in personal computers is 100 MHz, whereas processor clock speeds are now in the range of 0.5-1 GHz. It is therefore not uncommon for the processor to outpace memory performance by a factor of at least five or more.
Prior art computer systems have employed mechanisms for hiding, to some extent, the performance limitations of a separate DRAM memory. These mechanisms include complicated caching schemes and access scheduling algorithms.
Caching schemes place one or more levels of small, high-speed memory (referred to as cache memory or the "cache") between the processor and DRAM.
The cache stores a subset of data from the DRAM memory, which the processor is able to access at the speed of the cache. When the desired data is not within the cache (referred to as a cache "miss"), a memory management system must fetch the data into the cache from DRAM, after which the processor may access the data from the cache. If the cache is already full of data, it may also be necessary to write a portion of the cached data back to DRAM before new data may be transferred from DRAM to the cache. Cache performance is dependent on the locality of data in a currently executing program. Dispersed data results in a greater frequency of cache misses, diminishing the performance of the cache. Further, the complexity of the system is increased by the memory management unit needed to control cache operations.
Access scheduling algorithms attempt to anticipate memory access operations and perform prefetching of data to minimize the amount of time a processor must wait for data. Memory access operations may be queued and scheduled out of order to optimize access. Access scheduling may be performed at compile time by an optimizing compiler and /or at runtime by a scheduling mechanism in the processor.
Prefetching is effective for applications which perform memory access operations intermittently, and which have sufficient operations independent of the memory access to occupy the processor while prefetching is performed. For example, if there are multiple memory access operations within a short period, each successive prefetching operation may cause delays in subsequent prefetching operations, diminishing the effectiveness of prefetching. Further, where operations are conditioned on data in memory, the processor may still experience wait states if there are insufficient independent operations to perform. Also, where data access is conditioned on certain operations,
prefetching may not be feasible. Thus, as with caching schemes, the performance of access scheduling algorithms is program dependent and only partially effective. Further, the implementation of access scheduling introduces further complexity into the system.
SUMMARY OF THE INVENTION
An active dynamic random access memory (DRAM) architecture is described. The active DRAM device is configured with a standard DRAM interface, an array of memory cells, a processor, local program memory and high speed interconnect. The processor comprises, for example, a vector unit that supports chaining operations in accordance with program instructions stored in the local program memory. By integrating a processor with the memory in the DRAM device, data processing operations, such as graphics or vector processing, may be carried out within the DRAM device without the performance constraints entailed in off-chip bus communication. A host processor accesses data from its respective active DRAM devices in a conventional manner via the standard DRAM interface.
In an embodiment, multiple DRAM devices are coupled via the high speed interconnect to implement a parallel processing architecture using a distributed shared memory (DSM) scheme. The network provided by the high speed interconnect may include active DRAM devices of multiple host processors.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a general purpose computer system wherein an active DRAM device may be implemented in accordance with an embodiment of the invention.
Figure 2 is a block diagram of an active DRAM device in accordance with an embodiment of the invention.
Figure 3 A is a block diagram of a high-speed interconnect used for DSM communication in an active DRAM device in accordance with an embodiment of the invention.
Figure 3B is a block diagram of an S-connect node for use in the interconnect of Figure 3A.
Figure 4 is a block diagram of an embodiment of a processor for use in an active DRAM device in accordance with an embodiment of the invention.
Figure 5A is a block diagram of a vector processing apparatus in accordance with an embodiment of the invention.
Figure 5B is a block diagram of a vector processing apparatus with chaining in accordance with an embodiment of the invention.
Figure 6 is a block diagram of a DSM system implemented with multiple active DRAM devices in accordance with an embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
An active dynamic random access memory (DRAM) architecture is described. In the following description, numerous specific details are set forth in order to provide a more thorough description of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without these specific details. In other instances, well-known features have not been described in detail so as not to obscure the invention.
An embodiment of the invention implements a processor on-chip with
DRAM memory. The on-chip processor comprises a vector unit for applications such as graphics processing. The DRAM memory is made dual ported so that the on-chip processor may access the data via an internal high-speed port while a host computing system accesses the data from off -chip via a conventional DRAM interface. The resulting DRAM device appears to the host computing system as a conventional DRAM device, but provides internal processing capabilities with the speed and performance of on-chip signaling.
In one embodiment, a high-bandwidth serial interconnect is provided for distributed shared memory (DSM) communication with other similar active DRAM devices. Using DSM, a parallel processing architecture is achieved that provides an inexpensive, scalable supercomputing model in a conventional computer system.
Embodiment of Host Computer System
The computer systems described below are for purposes of example only. An embodiment of the invention may be implemented in any type of stand-
alone or distributed computer system or processing environment, including implementations in network (NC) computers and embedded devices (e.g., web phones, smart appliances, etc.).
An embodiment of the invention can be implemented, for example, as a replacement for, or an addition to, main memory of a processing system, such as the general purpose host computer 100 illustrated in Figure 1. A keyboard 110 and mouse 111 are coupled to a system bus 118. The keyboard and mouse are for introducing user input to the computer system and communicating that user input to processor 113. Other suitable input devices may be used in addition to, or in place of, the mouse 111 and keyboard 110. I/O (input/ output) unit 119 coupled to system bus 118 represents such I/O elements as a printer, A/V (audio/video) I/O, etc.
Computer 100 includes a video memory 114, main memory 115 (such as one or more active DRAM devices) and mass storage 112, all coupled to system bus 118 along with keyboard 110, mouse 111 and processor 113. The mass storage 112 may include both fixed and removable media, such as magnetic, optical or magnetic optical storage systems or any other available mass storage technology. Bus 118 may contain, for example, thirty-two address lines for addressing video memory 114 or main memory 115. The system bus 118 also includes, for example, a 64-bit data bus for transferring data between and among the components, such as processor 113, main memory 115, video memory 114 and mass storage 112. Alternatively, multiplex data/address lines may be used instead of separate data and address lines.
In one embodiment of the invention, the processor 113 is a SPARC microprocessor from Sun Microsystems, Inc., a microprocessor manufactured by
Motorola, such as the 680X0 processor or a microprocessor manufactured by Intel, such as the 80X86, or Pentium processor. However, any other suitable microprocessor or microcomputer may be utilized. Main memory 115 comprises one or more active DRAM devices in accordance with an embodiment of the invention. Video memory 114 is a dual-ported video random access memory. One port of the video memory 114 is coupled to video amplifier 116. The video amplifier 116 is used to drive the cathode ray tube (CRT) raster monitor 117. Video amplifier 116 is well known in the art and may be implemented by any suitable apparatus. This circuitry converts pixel data stored in video memory 114 to a raster signal suitable for use by monitor 117. Monitor 117 is a type of monitor suitable for displaying graphic images. Alternatively, the video memory could be used to drive a flat panel or liquid crystal display (LCD), or any other suitable data presentation device.
Computer 100 may also include a communication interface 120 coupled to bus 118. Communication interface 120 provides a two-way data communication coupling via a network link 121 to a local network 122. For example, if communication interface 120 is an integrated services digital network (ISDN) card or a modem, communication interface 120 provides a data communication connection to the corresponding type of telephone line, which comprises part of network link 121. If communication interface 120 is a local area network (LAN) card, communication interface 120 provides a data communication connection via network link 121 to a compatible LAN. Communication interface 120 could also be a cable modem or wireless interface. In any such implementation, communication interface 120 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information.
Network link 121 typically provides data communication through one or more networks to other data devices. For example, network link 121 may provide a connection through local network 122 to local server computer 123 or to data equipment operated by an Internet Service Provider (ISP) 124. ISP 124 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 125. Local network 122 and Internet 125 both use electrical, electromagnetic or optical signals which carry digital data streams. The signals through the various networks and the signals on network link 121 and through communication interface 120, which carry the digital data to and from computer 100, are exemplary forms of carrier waves transporting the information.
Computer 100 can send messages and receive data, including program code, through the network(s), network link 121, and communication interface 120. In the Internet example, remote server computer 126 might transmit a requested code for an application program through Internet 125, ISP 124, local network 122 and communication interface 120. The received code may be executed by processor 113 as it is received, and/or stored in mass storage 112, or other non-volatile storage for later execution. In this manner, computer 100 may obtain application code in the form of a carrier wave.
Application code may be embodied in any form of computer program product. A computer program product comprises a medium configured to store or transport computer readable code or data, or in which computer readable code or data may be embedded. Some examples of computer program products are CD-ROM disks, ROM cards, floppy disks, magnetic tapes, computer hard drives, servers on a network, and carrier waves.
Embodiment of an Active DRAM Device
An embodiment of the invention is implemented as a single active DRAM IC device, with an internal processor and DRAM memory sharing a common semiconductor substrate. Prior art DRAM ICs typically store only a portion of a full processor data word (i.e., an "operand"), with the complete processor data word being interleaved between multiple DRAM ICs. In accordance with an embodiment of the invention, the DRAM memory within the active DRAM device is configured to store complete operands. For example, for a double- precision floating point data type, sixty-four bits are stored within the active DRAM device. This permits the internal processor to operate on data within the memory space of its respective active DRAM device without the bandwidth limitations of an off-chip system bus or related off-chip I/O circuitry.
Figure 2 is a block diagram illustrating an active DRAM device in accordance with an embodiment of the invention. The device comprises processor 205, program memory 201, interconnect 203 and two-port DRAM memory 206, all of which are coupled to internal data bus 411 and address bus 412. In addition to the data and address busses, the device may include a control bus for the communication of bus control signals in accordance with the implemented bus communication protocol (e.g., synchronization, bus request and handshaking signals, etc.). In other embodiments, more than one pair of internal data and address busses may be used (e.g., interconnect 203 and processor 205 may communicate over a separate bus or other signaling mechanism).
DRAM memory 206 comprises a block of conventional high density DRAM cells. The manner in which the DRAM cells are configured may differ for
various embodiments. In one embodiment, DRAM memory 206 comprises multiple banks of DRAM cells. The use of banks can be used to facilitate the interleaving of access operations between the banks for faster memory access times and better collision performance (e.g., the internal processor 205 may be accessing one bank of data while a host processor accesses another bank of data via the conventional DRAM interface 207.
For example, in a 16 Megabit DRAM memory block, one possible configuration is to have two banks of memory cells, each bank having thirty- two rows of 256-bit data. The column width may vary among embodiments depending on the bus width of data bus 411. For example, data bus 411 may be sixty-four bits wide to accommodate access of a full double-precision floating point value from DRAM memory 206 in a single read or write cycle. Also, data bus 411 may be eight or sixteen bits wide, in which case, multiple read or write cycles (depending on the number of bytes used to represent the data type of the requested operand) may be utilized to access a complete data value in DRAM memory 206.
The memory address is typically (though not necessarily) divisible into a bank address, a row address and a column address. The bank address is used to select the desired bank of storage cells, and the row and column addresses are used to select the row and column of storage cells within the specified bank. The row address may be used to precharge the desired row of storage cells prior to reading of data. Data is read from or written to the subset of storage cells of the selected row specified by the column address.
A further component of the active DRAM device is a conventional DRAM interface 207 coupled to a second port of DRAM memory 206 for off-chip
communication with a host processor. DRAM interface 207 may be, for example, a conventional SDRAM or Rambus interface as is known in the art. In an alternate embodiment, DRAM memory 206 may have a single port that is shared between the internal processor 205 and the DRAM interface 207. For the single-port embodiment, a collision mechanism may be employed to stall the instruction execution pipeline of one or the other of the internal processor 205 and the external host processor (not shown) in the event that both processors wish to access the single port of DRAM memory 206 at the same time (e.g., concurrent write requests). As described above, DRAM memory 206 may be configured in multiple banks such that concurrent memory access to different banks of memory do not result in a collision.
Program memory 201 may be either additional DRAM memory or static RAM (SRAM) memory, for example. In one embodiment, program memory 201 is addressable by processor 205, but is separate from the addressable memory space accessible via conventional DRAM interface 207. In other embodiments, program memory 201 may be part of the memory space accessible to DRAM interface 207. Program memory 201 is used to store the program instructions embodying routines or applications executed by internal processor 205. Additionally, program memory 201 may be used as a workspace or temporary storage for data (e.g., operands, intermediate results, etc.) processed by processor 205. A portion of program memory 201 may serve as a register file for processor 205.
A non-volatile memory resource 202 is provided in tandem with program memory 201 in the form of flash SRAM, EEPROM or other substantially nonvolatile memory elements. Memory 202 is configured to provide firmware for processor 205 for system support such as configuration parameters and routines
for start-up, communications (e.g., DSM routines for interacting with interconnect 203) and memory management.
Interconnect 203 provides a high-bandwidth connection mechanism for internal processor 205 to communicate off-chip with other devices. In one embodiment of the invention, Interconnect 203 provides one or more channels for supporting a distributed shared memory (DSM) environment comprising, for example, multiple active DRAM devices. (It will be obvious to one skilled in the art that processor and /or memory elements other than active DRAM devices may also be coupled into the DSM environment via connection to interconnect 203.
Interconnect 203 and processor 205 may be configured to support communications (e.g., DSM communications) via message passing or shared memory as is known in the art. One suitable interconnect mechanism in accordance with an embodiment of the invention is a conventional S-connect node used for packet-based communication. Figures 3A and 3B are block diagrams of an S-connect node.
S-connect node 300 may comprise a crossbar switch (6 x 6: six packet sources and six packet drains) with four 1.3 GHz serial ports (204) and two 66 MHz 16-bit parallel ports (301). The parallel and serial ports are bi-directional and full duplex, and each has a data bandwidth of approximately 220 megabytes per second. As shown in Figure 3B, S-connect node 300 further comprises a pool of buffers (304) for incoming and outgoing packets, transceivers (XCVRs 1-4) for each serial port, a distributed phase-locked loop (PLL 305) circuit and one or more routing tables and queues (303).
In Figure 3B, XCVRs 1-4, as well as parallel ports A and B, are coupled to router /crossbar switch 302. Router /crossbar switch 302 receives an input data packet from a serial or parallel port acting as a packet source or input and accesses a routing table in routing tables /queues 303 to determine an appropriate serial or parallel port to act as a packet drain or destination. If the given port selected as the packet drain is idle, router /crossbar switch may transfer the packet directly to the given port. Otherwise, the packet may be temporarily stored in buffer pool 304. A pointer to the given buffer is stored in routing tables /queues 303 within a queue or linked list associated with the given port.
PLL 305 is coupled to each of the serial ports to receive incoming packet streams. Each such packet stream is timed in accordance with the clock of its source interconnect circuit. PLL 305 is configured with multiple phase detectors to determine the phase error between each incoming packet stream and the clock signal (CLK) generated by PLL 305. The sum of the phase errors from all of the serial inputs is used to adjust the clock signal of PLL 305 to provide synchronization with other interconnect nodes in the system.
An embodiment of the invention configures an S-connect node with both parallel ports 301 coupled to processor 205 (e.g., via busses 411 and 412), and with the serial ports (serial links 204) providing mechanisms for I/O communication with other devices or communication nodes. In one embodiment, interconnect 203 is implemented with an S-connect macrocell comprising approximately fifty thousand logic gates or less. It will be obvious to one skilled in the art that other forms of interconnect circuits may be utilized in place of, or in addition to, S-connect nodes to provide a high-bandwidth I/O communication mechanism for DSM applications.
Referring back to Figure 2, processor 205 provides the main mechanism for executing instructions associated with application programs, operating systems and firmware routines, for example. Execution is carried out by a process of fetching individual instructions from program memory 201 or Flash SRAM/EEPROM 202, decoding those instructions, and exerting control over the components of the processor to implement the desired function(s) of each instruction. The manner in which processor 205 carries out this execution process is dependent on the individual components of the given processor and the component interaction provided by the defined instruction set for the given processor architecture.
Processor 205 is described below with reference to Figure 4, which illustrates an example vector processor architecture in accordance with one embodiment of the invention. It will be apparent that processor 205 may similarly implement any known scalar, vector or other form of processor architecture in accordance with further embodiments of the invention.
In Figure 4, processor 205 comprises arithmetic logic units (ALUs) 400A-B, optional register file 404, program counter (PC) register 408, memory address register 407, instruction register 406, instruction decoder 405 (also referred to as the control unit) and address multiplexer (MUX) 409. Instruction decoder 405 issues control signals to each of the other elements of processor 205 via control bus 410. Data is exchanged between elements 404-408 and memory 201, 202 or 206 via data bus 411. Multiplexer 409 is provided to drive address bus 412 from either program counter register 408 or memory address register 407.
Program counter register 408, memory address register 407 and instruction register 406 are special function registers used by processor 205. If the processor architecture implements a stack memory structure, processor 205 may also include a stack pointer register (not shown) to store the address for the top element of the stack. Program counter register 408 is used to store the address of the next instruction for the current process, and is updated during the execution of each instruction.
Updating program counter register 408 typically consists of incrementing the current value of program counter register 408 to point to the next instruction. However, for branching instructions, program counter register 408 may be directly loaded with a new address as a jump destination, or an offset may be added to the current value. In the execution of a conditional branching instruction, condition codes generated by arithmetic logic units 400 A-B may be used in the determination of which update scheme is applied to program counter register 408.
When it is time to fetch the next instruction, program counter register 408 drives the stored PC value onto address bus 412 via MUX 409. Control bus 410 from instruction decoder 405 controls the loading of program counter register 408 and the selection mechanism of MUX 409 to drive the PC value onto address bus 412.
Memory address register 407 is used to hold the target memory address during execution of memory access instructions, such as "load" and "store." In variations of the load and store instructions, the memory address may be loaded into memory address register 407 from one of the registers in register file 404, from the results of an ALU operation, or from an address bit-field extracted
from an instruction (e.g., via masking and shifting) based on the implemented instruction set. When the memory access is initiated, memory address register 407 drives its output signal onto address bus 412 via MUX 409.
Incoming data (including instructions) loaded via data bus 411 from memory 201, 202 or 206, or other components external to processor 205, may be stored in one of the registers of processor 205, including register file 404. The address asserted on address bus 412 by program counter register 408 or memory address register 407 determines the origination of the incoming data. Outgoing data is output from one of the registers of processor 205 and driven onto data bus 411. The address asserted on address bus 412 determines the outgoing data's destination address, in DRAM memory 206 for example. Instruction decoder 405 enables loading of incoming data via control bus 410.
When an instruction is fetched from memory (e.g., from program memory 201), the instruction is loaded into instruction register 406 (enabled by instruction decoder 405 via control bus 410) where the instruction may be accessed by instruction decoder 405. In one embodiment, processor 205 is configured as a dual-issue processor, meaning that two instructions are fetched per fetch cycle. Both instructions are placed into instruction register 406 concurrently for decoding and execution in parallel or pipelined fashion. Instruction decoder 405 may further comprise a state machine for managing instruction fetch and execution cycles based on decoded instructions.
Arithmetic logic units 400A-B provide the data processing or calculating capabilities of processor 205. Arithmetic logic units 400A-B comprise, for example, double-input/ single-output hardware for performing functions such as integer arithmetic (add, subtract), boolean operations (bitwise AND, OR, NOT
(complement)) and bit shifts (left, right, rotate). To improve the performance of the processor, arithmetic logic units 400A-B may further comprise hardware for implementing more complex functions, such as integer multiplication and division, floating-point operations, specific mathematical functions (e.g., square root, sine, cosine, log, etc.) and vector processing or graphical functions, for example. Control signals from control bus 410 control multiplexers or other selection mechanisms within arithmetic logic units 400A-B for selecting the desired function. In a multi-issue instruction architecture, arithmetic logic units 400A-B may comprise multiple ALUs in parallel for simultaneous execution of instructions. In a single issue architecture, a single ALU may be implemented.
As shown in Figure 4, one ALU embodiment comprises vector operations unit 401, vector addressing unit 402 and a scalar operations unit 403. Vector operations unit 401 comprises pipelined vector function hardware (e.g., adders, multipliers, etc.) as will be more fully described later in this specification. Vector addressing unit 402 may be implemented to provide automatic generation of memory addresses for elements of a vector based on a specified address for an initial vector element and the vector stride, where the vector stride represents the distance in memory address space between consecutive elements of a vector. Scalar operations unit 403 comprises scalar function hardware as previously described for performing single-value integer and floating point operations.
The inputs and outputs of arithmetic logic units 400A-B are accessed via optional register file 404 (or directly from local memory 201, 202 or 206). The specific registers of register file 404 that are used for input and output are selected by instruction decoder 405 based on operand address fields in the decoded instructions. Additional output signals (not shown) may include condition codes such as "zero," "negative," "carry" and "overflow." These
condition codes may be used in implementing conditional branching operations during updating of program counter register 408.
Register file 404 comprises a set of fast, multi-port registers for holding data to be processed by arithmetic logic units 400A-B, or otherwise frequently used data. The registers within register file 404 are directly accessible to software through instructions that contain source and destination operand register address fields. Register file 404 may comprise multiple integer registers capable of holding 32-bit or 64-bit data words, as well as multiple floating point registers, each capable of storing a double-precision floating point data word. A vector processor embodiment may also contain multiple vector registers or may be configured with a vector access mode to access multiple integer or floating point registers as a single vector register.
The concepts described herein may be applied to register files and registers (i.e., data words) of any size. Further, because memory access to memories 201, 202 and 206 is local, processor 205 may operate directly on data in memories 201, 202 or 206 without the use of register file 404 as an intermediate storage resource. Alternatively, a region of local memory (e.g., a specified address range of program memory 201) may be treated as a register file.
It will be obvious to one skilled in the art that other processor architectures may be implemented within an active DRAM device without departing from the scope of the invention.
Vector Processing with Chaining
In the field of supercomputing, vector processing is used in a massively parallel processing environment to improve computer processing performance for vector-based applications such as graphics or science applications. In accordance with an embodiment of the invention, an active DRAM device provides an inexpensive, scalable mechanism for implementing a supercomputing system using distributed shared memory (DSM). For enhanced vector processing performance, embodiments of the invention may be implemented with a vector processor that supports chaining of vector functions. Vector processing and chaining are described below with reference to Figures 5A-B.
In a vector processing system, data operands are accessed as data vectors (i.e., an array of data words). Vectors may comprise individual elements of most data types. For example, vector elements may comprise double precision floating point values, single precision floating point values, integers, or pixel values (such as RGB values). A pipelined vector processor performs equivalent operations on all corresponding elements of the data vectors. For example, vector addition of first and second input vectors involves adding the values of the first elements of each input vector to obtain a value for the first element of an output vector, then adding the values of the second elements of the input vectors to obtain a value for a second element of the output vector, etc.
Pipelining entails dividing an operation into sequential sections and placing registers for intermediate values between each section. The system can then operate with improved throughput at a speed dependent upon the delay of the slowest pipeline section, rather than the delay of the entire operation. An example of a pipelined function is a 64-bit adder with pipeline stages for adding values in eight-bit increments (e.g., from least significant byte to most significant
byte). Such an adder would need eight pipeline stages to complete addition of 64-bit operands, but would have a pipeline unit delay equivalent to a single eight-bit adder.
Once a first input pair of vector elements have passed through the initial pipeline stage of a vector function, a second pair of vector elements may be input into the initial pipeline stage while the first pair of vector elements are processed by the second pipeline stage. After an initial latency period during which data vectors are constructed with data from memories 201, 202 or 206, and the first vector elements are propagated through the vector function pipeline, a new output vector element is generated from the vector function each cycle.
Figure 5A illustrates a vector operation carried out on input vectors A and B to generate output vector C. For example, an n-element vector multiplication operation would be represented as:
A * B = C = [ (Ao * Bo), (Ai * Bi), (A2 * B2), ..., (An * Bn) ]
Each of vectors A-C has an associated stride value specifying the distance between consecutive vector elements in memory address space. Vectors A and B are provided as inputs 501A and 501B, respectively, to vector function 500. Output 502 of vector function 500 is directed to vector C. Vector function 500 comprises multiple pipeline stages or units each having a propagation delay no greater than the period of the processor clock. The latency of vector function 500 is equivalent to the number of pipeline stages multiplied by the period of the processor clock, plus the delay for constructing vectors A-C, if any. Vectors A-C may be stored as elements in respective vector registers, or the vectors may be
accessed as needed directly from memory using the address of the first element of a respective array and an offset based on the vector's stride multiplied by the given element index.
Vector chaining is a technique applied to more complex vector operations having multiple arithmetic operations, such as:
(A * B) + D = E = [ ((A0 * Bo) + Do), ((Ai * Bi) + Di), ..., ((An * Bn) + Dn) ]
which may be also represented in terms of intermediate result vector C above as:
C + D = E = [ (Co + Do), (Ci + Di), ..., (Cn + Dn) ]
The above relation could be calculated by first generating vector C from vectors A and B, and then, in a separate vector processing operation, generating vector E from vectors C and D. However, with chaining, it is unnecessary to wait for vector C to be complete before initiating the second vector operation. Specifically, as soon as the first element of vector C is generated by the multiplication pipeline, that element may be input into the addition pipeline with the first element of vector D. The resulting latency is equivalent to the sum of the latency of the multiplication operation and the propagation delay of the addition pipeline. The addition pipeline incurs no delay for creating vectors from memory, and throughput remains at an uninterrupted rate of one vector element per cycle. Chaining may be implemented with other vector operations as well, and is not limited to two such operations.
Figure 5B illustrates a chained vector function with vector function 500 performing the first operation (i.e., multiplication) and vector function 505 performing the second operation (i.e., addition). Vector C is generated from output 502 of vector function 500 as described for Figure 5A, but rather than being collected in its entirety before proceeding, output 502 is provided as input 503B to vector function 505. Vector D is provided to input 503A of vector function 505, synchronized with the arrival of the corresponding element of vector C from output 502. Function 505 generates vector E one element per cycle via output 504.
Pipelined vector processors have a potential for high execution rates. For example, a dual-issue processor that achieves a processor speed of 400 MHz through pipelining is theoretically capable of up to 800 MFLOPs (million floatingpoint operations per second). Using the high-bandwidth interconnect circuits to scale the memory system with further active DRAM devices adds further to the processing potential. For example, a four gigabyte distributed shared memory (DSM) system implemented with sixteen-megabit active DRAM devices (i.e., sixteen megabits of DRAM memory 206) comprises 2048 active DRAM devices in parallel for a theoretical potential of 1.6 TeraFLOPs (1012 floating point operations per second) for strong supercomputing performance. In accordance with an embodiment of the invention, a DSM configuration using active DRAM devices is described below.
Distributed Shared Memory (DSM)
In accordance with one or more embodiments of the invention, the processing and memory resources of multiple active DRAM devices are joined in a scalable architecture to enable parallel processing of applications over a
distributed shared memory space. Each added active DRAM device adds to the available shared memory space, and provides another internal processor capable of executing one or more threads or processes of an application.
The high-bandwidth interconnect 203 previously described provides the hardware mechanism by which the internal processor of one active DRAM device may communicate with another active DRAM device to access its DRAM memory resources. Data packets may be used to transmit data between devices under the direction of the router within each interconnect. Multiple devices can be coupled in a passive network via the serial links 204, with each device acting as a network node with its own router.
The software implementation of the shared memory space may be managed at the application level, using messaging or remote method invocations under direct control of the application programmer, or it may be managed by an underlying DSM system. If implemented at the application level, the programmer manages how data sets are partitioned throughout the distributed memory, and explicitly programs the transmission of messages, remote method invocations (RMI) or remote procedure calls (RPC) between the devices to permit remote memory access where needed. While the explicit nature of the application-level design may provide certain transmission efficiencies, it may be burdensome to the application programmer to handle the necessary messaging and partitioning details of distributed memory use.
In a DSM system, applications are written assuming a single shared virtual memory. The distributed nature of the shared memory is hidden from the application in the memory access routines or libraries of the DSM system, eliminating the need for the application programmer to monitor data set
partitions or write message passing code. An application may perform a simple load or store, for example, and the DSM system will determine whether the requested data location is local or remote. In the event that the data location is local, standard memory access is performed. However, in the event that the data location is remote, the DSM system transparently initiates a memory access request of the remote device using RMI, RPC, or any other suitable messaging protocol. The DSM routines at the remote device will respond to the memory access request independent of any application processes at the remote device.
In accordance with one or more embodiments of the invention, multiple active DRAM devices may be coupled together in a shared memory configuration using application-level message passing or DSM techniques. Preferably, DSM hardware support is provided (e.g., within interconnect 203) that is transparent to the instruction set architecture of processor 205. For example, a memory reference is translated into either a local reference or a remote reference that is handled by the DSM hardware. Alternatively, DSM routines may be implemented as firmware in non-volatile memory 202, as software in program memory 201, as hardware routines built into the instruction set architecture of the internal processor 205, or as some combination of the above.
Figure 6 is a block diagram of a shared memory system comprising multiple active DRAM devices in accordance with an embodiment of the invention. The system comprises one or more host processors 600 and two or more active DRAM devices 602. In the configuration shown, host processors A and B are coupled to system busses 601A and 601B, respectively. Also coupled to system bus 601 A via respective DRAM interfaces are multiple active DRAM devices AI-AN- Active DRAM devices AI-A each have high-bandwidth
interconnect (e.g., serial links 604 and 605) coupled to a passive network 603 to exchange messages, such as for DSM communications or application-level messaging. Thus, multiple active DRAM devices AI-AN are configured to implement a distributed shared memory system. Host processor A may access data from active DRAM devices AI-AN via system bus 601 A and the respective conventional DRAM interfaces of the devices.
The interconnect system described herein can support serial communications between devices up to 10 meters away over standard serial cable and up to 100 meters away over optical cables. Thus it is possible to scale a DSM system to include other active DRAM devices within a single computer system, to other active DRAM devices in add-on circuit boxes or rack-mount systems, and to other devices in other computer systems within the same building, for example. In Figure 6, for example, the system is scaled to include a second host processor (host processor B) coupled to a further active DRAM device Bi via a second system bus 601B. Active DRAM device Bl is coupled into network 603 via high-bandwidth interconnect 606. Thus, the memory resources of active DRAM device Bl become part of the shared memory of the DSM system. Network 603 could be used to interconnect other DSM devices as well, including devices other than active DRAM devices.
Thus, an active DRAM architecture has been described in conjunction with one or more specific embodiments. The invention is defined by the claims and their full scope of equivalents.