WO1994008295A1

WO1994008295A1 - Method and apparatus for memory interleave reduction

Info

Publication number: WO1994008295A1
Application number: PCT/US1993/009275
Authority: WO
Inventors: Richard E. Morley; Douglas H. Currie, Jr.; Gabor L. Szakacs
Original assignee: Flavors Technology Inc.
Priority date: 1992-10-01
Filing date: 1993-09-29
Publication date: 1994-04-14
Also published as: AU5294193A

Abstract

A method and apparatus is provided for allocating a memory resource that includes a plurality of memory banks (914, 918) to a plurality of slots (916, 920) of a backplane of a bus (24) that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles (4A), such that each slot (914, 918) receives at least one grant of access (4B) to each of the memory banks (916, 920) of each memory resource within each frame. Apparatus is provided for achieving a variety of memory interleaving configurations, as well as a variety of memory interleave reduction factors. An embodiment of a memory board (26, 27) with a fixed memory interleave configuration and a fixed memory interleave reduction factor can be used in tandem with an embodiment of a memory board (26, 27) with a software selectable memory interleave configuration and interleave reduction factor.

Description

METHOD AND APPARATUS FOR MEMORY INTERLEAVE REDUCTION

CROSS-REFERENCES TO RELATED APPLICATIONS This application is a continuation-in-part application of Application Serial No. 07/351,185, filed May 12, 1989.

FIELD OF THE INVENTION This invention relates to synchronous buses for computer systems, and in particular to a high-speed synchronous bus adapted for use with real-time multiprocessing computer systems, and to a memory interleaving method and apparatus for use with the high-speed synchronous bus.

BACKGROUND OF THE INVENTION

Computer buses may be classified as synchronous or asynchronous. There are several classes of asynchronous buses, including: full asynchronous, wherein data transfer timing, data and command signals are not centrally synchronized; partial asynchronous, wherein the clock signal is centrally generated, but data and command signals are not; and "token" based systems, wherein command and control of the bus is determined by the possession of a token. In cases where the transfer timing is not centrally synchronized, each initiating device generates the bus transfer timing while it is in control of the bus using a clock that is local to the device. The timing of data transfer can also be controlled in part by the accessed device using an interlocked handshake. This scheme works well when one processor controls the bus for many data cycles because the clock signals are generated at the same point on the bus as the data, address, and control signals. Consequently, all these signals reach their destination with a known phase relationship that does not depend on the physical position of each board on the bus. An asynchronous scheme does not work well in systems which tend to interleave transfers among many processors, due to problems with unknown phase relationships.

Synchronous buses employ a central clock signal generator which times all data transfer operations. Using a synchronous bus makes it possible to approach optimum data transfer rates, while minimizing associated overhead. Synchronous buses allow "split transaction" operations that interleave two or more data transfers. In a split transaction system, the initiating processor will use one bus cycle to request a transfer to or from a responding processor. The responding processor will use a later cycle to return data and/or transfer completion status to the initiating processor. Between the request and return cycles, the bus may be used by other processors to perform other transfers. In this mode of operation a transfer uses no more than two bus cycles, not necessarily consecutively, regardless of the access time of the responding processor.

Synchronous buses are generally limited in performance by the maximum clock rate supported by the bus. This rate is determined by worst-case conditions in transfer timing. Asynchronous buses do not employ a clock, and instead are limited in overall performance only by the efficiency of the modules that plug into them.

One of the key factors in determining the maximum clock rate of a synchronous bus is the bus propagation delay. In a typical bus with a clock distribution scheme in which each module receives the clock signal at the same time, the worst- case propagation delay, or a multiple thereof, must be added to any component delays when calculating the minimum clock cycle time. Thus, regardless of the quickness of the system components, the ultimate limitations on bus transfer rates are the spatial and electrical properties of the bus backplane.

Large multiprocessing systems have a mix of performance requirements not presently available in any standard system bus. The bus requires very high performance due to the large number of processors accessing it. For example, the bus must be able to switch efficiently between many processors, yet minimize switching overhead. The bus must also be long enough to accommodate the insertion of enough modules to support large-scale parallelism, without reducing the maximum clock rate supported by the bus.

SUMMARY OF THE INVENTION A high-speed synchronous bus is provided for a computer system that allows efficient communication among a plurality of processors or other data devices coupled to the bus.

The bus provides systematic bus access, without arbitration, to data devices connected to the bus, and employs a central bus grant scheme for pre-allocated access to the bus by each data device. Data devices cannot seize or hold the bus. Means are provided for insuring that the bus is always in a read or write mode for a time which is long relative to the cycle time of the bus. Bus timing includes, for each bus frame, a number of time slots allocated to respective hardware connection slots for connecting data devices. During a bus frame, access to memory is synchronized.

According to another aspect of the invention, two clock signals are propagated in opposing directions on the bus to control similarly traveling data signals. Also included are means for achieving a fast bus cycle time, typically in the range of 20 - 60 nanoseconds per cycle, which is less than the cycle time of the computer system. A backplane is provided of a length which can be driven reliably at the cycle time of the bus, and which is modularly expandable. In a preferred embodiment, a first clock signal is reflected at an end of the bus to produce a second clock signal in a known phase relationship with the first clock signal, resulting in two signals traveling in opposing directions. The two oppositely traveling clock signals control similarly traveling data signals. Each memory board communicates with the bus at full bandwidth. Each processor uses only a portion of the bandwidth. All the processors taken together utilize the entire bus bandwidth.

In the preferred embodiment, the bus allocator includes bus control signal drivers in communicating relationship with the bus, adapted to provide signals to the bus; grant list hardware adapted to generate bus grants in a particular order, in cooperation with the bus control signal drivers; means for generating bus timing signals, in communicating relationship with the bus grant list hardware and the bus control signal drivers; and means for generating a clock signal, in communicating relationship with the bus timing generator. The bus grant list hardware grants memory access sequencing so that each of the processors obtains access to successive memory banks on successive bus accesses.

The bus of the invention eliminates system arbitration and includes a reduced number of bus lines with respect to a request-driven, randomly interleaved synchronous bus. The bus of the invention is able to run .at data rates comparable to those of an asynchronous bus, while enjoying the arbitration switching rate of a synchronous bus.

The bus of the invention provides the very high performance required by the large number of processors accessing it, i.e., it can switch efficiently between many processors, while minimizing switching overhead. The bus is long enough to accommodate the insertion of enough modules to support large-scale parallelism, without slowing down the bus.

Both the bus and the memory are used most of the time. Memory refreshes are executed at a rate that results in high reliability and low power consumption. The bus of the invention allows boards to be engaged and disengaged without requiring a system power-down, and allows software controllable synchronization of the bus frame with an A/C power source, or video raster timing generator. Also, a method and apparatus is provided for allocating a memory resource that includes a plurality of memory banks to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, such that each slot receives at least one grant of access to each of the memory banks of each memory resource within each frame. Apparatus is provided for achieving a variety of memory interleaving configurations, as well as a variety of memory interleave reduction factors. An embodiment of a memory board with a fixed memory interleave configuration and a fixed memory interleave reduction factor can be used in tandem with an embodiment of a memory board with a software selectable memory interleave configuration and interleave reduction factor.

DESCRIPTION OF THE DRAWING

The invention will be more fully understood by reading the following detailed description, supported by the accompanying figures, in which:

Fig. IA is a block diagram of a multiprocessor computer system including the bus of the invention;

Fig. IB is a schematic diagram of the various lines included in the wide high-speed global bus of the invention;

Fig. 2 is a block diagram of the bus allocation (BOSS) board of Fig. 1; Fig. 3 is a block diagram of the global bus grant list hardware;

Figs. 4A through 4H are timing diagrams that illustrate operations of the system;

Fig. 5 is a timing diagram illustrating read and write timing;

Fig. 6 is a diagram of signal line access timing; Fig. 6A is a diagram showing the timing relationship of the grant lines with the other bus signal lines; Fig. 7 is a schematic diagram illustrating the direct mounting of electronic components on the backplane of the bus;

Fig. 8 is a block diagram of an embodiment of a memory interleave apparatus of a memory board of the invention;

Fig. 9 is a block diagram of an embodiment of a memory interleave apparatus of a processor board of the invention;

Fig. 10 is a listing of 8-way interleaved allocation of a plurality of memory banks to a plurality of slots; Fig. 11 is a listing of 8-way interleaved allocation, at reduction factor four, of a plurality of memory banks to a plurality of slots;

Fig. 12 is a block diagram of an embodiment of a software-selectable memory interleave apparatus of a memory board of the invention; and

Fig. 13 is a block diagram of an embodiment of a software-selectable memory interleave apparatus of a processor board of the invention.

DETAILED DESCRIPTION OF THE INVENTION A glossary of selected terms is set forth at the end of the detailed description.

With reference to Fig. IA, a representative computer system 8 including the bus of the present invention is shown.

The computer system 8 comprises a combination of input/output (I/O) boards 18 and 22, global memory boards 26 and 27, processor boards 30 and 31, and a combination I/O and global bus allocation (BOSS) board 12, each attached to a high-speed global bus 24. The bus allocation portion of the BOSS board 12 and the high-speed global bus 24 taken together are referred to as the bus of the invention. The system 8 also typically includes a work station 10, a host computer 16, and a high resolution graphics display 20. Additional memory, I/O and processor boards may be included in a particular system, and more than one host computer and more than one graphics display may be connected to the system. In a preferred embodiment, there are a maximum of forty slots on a global bus 24, each slot being capable of accepting one board. For identification purposes, each board has an ID PROM which is readable by the BOSS board 12 using separate backplane wires. Each global memory board, or other board that looks like memory to the bus (such as an I/O board) , is addressed according to the slot it is plugged into. Alternatively, the address may be changed at run-time as needed. Ideally, slot space is allocated to each board in a NuBus-like manner, (i.e., backplane wiring encodes the physical slot number so that when a board is plugged in, the board's location is known to the backplane and to the other boards) leaving the bulk of the address space open for re¬ assignment. The computer system 8 includes a number of independent user processors located on processor boards 30 and 31. Each board 30 and 31 may contain several processors, and are preferably based on standard microprocessor chips such as the Motorola 68030 or the Motorola 88000. Each processor preferably has 4 megabytes of local memory. The illustrated system 8 can have up to 128 processors, mounted four to a processor board, although it is not limited to this number.

The bus 24 is made available to each of the multiprocessor boards 30, 31 in the system by using a rotating grant scheme. According to the invention, there are no bus requests, only bus grants. A bus grant is allocated to each slot in turn on each clock cycle, regardless of the number of boards actually connected to the global bus 24. Thus, a high-speed, rotating bus allocation scheme is implemented that is independent of the number of data devices in communicating relationship with the backplane of the bus. Thus, in an embodiment including 128 processors, each multiprocessor board is granted a cycle every 1.28 microseconds (32 clock cycles, where each clock cycle is approximately 40 nanoseconds) . Using a fixed bus allocation scheme eliminates bus arbitration overhead, and provides predictable bus access latency and predictable throughput.

The sequence in which the processors obtain access to the bus 24 is stored in high speed RAM on the BOSS board 12. When a multiprocessor board obtains access, it can execute a read or write cycle as determined by the bus allocator on the BOSS board 12. Because the BOSS 12 determines the type of each transfer, the bus 24 is guaranteed to be performing only blocks of reads or blocks of writes at any one time. Thus, only a single data bus is necessary (32 lines) to operate at full capacity (one transfer per clock cycle) during each write burst and each read burst. Read/write transitions occur on a time scale large with respect to one clock cycle. In this embodiment, a delay of 8 clock cycles is required to make transitions between read and write modes, each write phase persists for 130,000 clock cycles, and each read phase persists for 270,000 clock cycles.

A frame is a period during which every processor is allowed to access memory. Each frame includes a read phase, a read/write transition phase, and a write phase. The BOSS regulates the cycle of read/write transitions, which is synchronized to a frame period of approximately 16 milliseconds.

There is a dead-time between frames during which the bus is kept idle to accommodate changes in the direction of data flow, and in some embodiments, to allow synchronization to external devices. It is possible to control the amount of dead time via software. For example, in graphics applications, it is advantageous to synchronize the frame rate to a video raster signal. For applications involving stepper motors, as in machine control, it is useful to synchronize with the 60 cycle alternating current source. If the frame rate is not synchronized to an external device, the minimum dead-time in this embodiment is 8 clock cycles. The main database for the computer system 8 resides in global memory located on the global memory boards 26 and 27. In the present embodiment, the global memory is arranged in units of 32 megabytes per global memory board. Each global memory board is divided into eight memory banks, wherein each memory bank includes multiple memory chips. The bus cycles at 40 nanoseconds, and the chips cycle more slowly at 200 nanoseconds, so the memory banks are accessed in sequence, but no more often than once every eight bus cycles, or every 320 nanoseconds. This process, referred to as eight-way interleaving, gives the memory chips time to return to equilibrium. Eight-way interleaving also permits a memory cycle to start on each bus cycle. The additional 120 nanoseconds provides time to perform error scrubbing and locked updates.

Bank ordering is enforced by the BOSS. A multiprocessor board is given a grant for a particular bank and must wait for the correct bank to come up in the interleaving sequence. A processor board which follows the standard sequence thus does not have to wait. Even if the standard sequence is not followed, each multiprocessor board is guaranteed access to the appropriate memory bank at least once every frame.

Upon receiving a read function signal from a processor, a memory board can respond in a variety of ways. Two response lines running from memory to processor encode responses such as "busy", "transfer accepted", "non-present memory", and "unsupported function". A third response line indicates "uncorrectable data error".

Memory interleave sequencing is determined before the BOSS allows any reads or writes. Memory interleaving is optimized by arranging data addresses so as to maximize the utilization of memory.

To refresh the dynamic RAMs of the global memory, additional cycles not used by the multiprocessor boards are needed, so three refresh cycles are inserted after every 128 granted cycles. This refreshes three of the eight banks approximately each five microseconds, which is close to the optimum rate, while allowing each microprocessor on the boards one access to each bank.

To maximize global bus utilization and to provide data coherency, the BOSS board 12 sequences bus grants according to a grant list providing each user processor access to successive memory banks on successive bus accesses. During any single clock period on the global bus, one of the memory banks is guaranteed to be available for reading or writing. While many banks may in fact be available, the system enforces a rule that on any given bus cycle, only a specific bank may be accessed, insuring predictable performance. (See the Appendix for an example of a grant list.)

The grant list for a system with the maximum number of memory boards and processor boards gives each user processor access to one memory bank. Thus, the grant list must be repeated eight times to give each user processor access to each memory bank. The length of the grant list multiplied by the number of times it must be repeated to give each user processor access to each bank is called the grant list epsilon. Epsilon is the minimum number of bus cycles in the read or write phase of a frame. The epsilon insures that each phase begins on the same bank for each user processor and that each user processor gets an access to every bank. If each user processor is partitioned into virtual processors, as in the present embodiment, herein referred to as cells, epsilon must be multiplied by the number of cells per user processor to insure that every cell gets an access to each memory bank.

Any two user processors can not access the same bank at the same time. For example, if user processor A reads its first data from bank 0 at the beginning of a frame, user processor B can not read its first data from bank 0 because bank 0 is already busy. Instead, user processor B reads data from bank 1, user processor C reads from bank 2, and so on. To ensure that each user processor gets a turn at every bank, - li ¬ the grant list is incrementally rotated each time it is repeated.

To insure data coherency, each cell must read from global memory in ascending global address order, write to global memory in ascending order, and never miss an opportunity to read or write. Also, each user processor must initiate reading and writing at the same memory bank.

Fig. 5 illustrates how bus activity associated with reading from and writing to global memory is coordinated with processing activity. Each cell must see the same data in global memory within each frame. When a cell computes a new value, the new value is stored in a separate local memory buffer so that cells reading the variable from global memory will not see the new value until the next frame. Processing always starts with cell 0. Each numbered tile in the upper row represents the time a user processor dedicates to processing the data stored in local memory associated with each cell. Each tile in the lower row represents bus activity corresponding to reads from and writes to global memory, as allocated by the BOSS board 12.

While cell 0 is processing, data to be processed by cells 1 and 2 are read from global memory into the local memory of their respective user processors, which may be the same processor. While cell 1 is processing, data to be processed by cells 2 and 3 are read from global memory into the local memory of their respective user processors, and so on for the rest of the cells up to n, the maximum number of cells. Each cell's data is ready by the time that cell begins processing, except for cell zero. After all the cell's data are read from global memory, the BOSS initiates a transition from the read to the write mode. Starting again with cell 0, data from the local memory associated with each cell is written to global memory. The time spent reading from global memory plus the time spent writing to global memory for each cell is equal to the cell's processing time. Therefore, the results of cell processing are always ready by the time it needs to write them to global memory, except for the very last cell, whose results cannot be written to global memory until the next frame. Thus, there is a cell (0) at the beginning of each frame which cannot read from global memory during that frame, and there is a cell (n+1) at the end of a frame which cannot write to global memory during that frame.

In a preferred embodiment, 128 cells are supported, numbered 1 to 128, and there are two additional cells numbered 0 and 129 which are used for debugging and other system functions.

The period of the frame is set so that it has a duration of l/60th of a second, or about 16 milliseconds. There are therefore about 400,000 bus cycles in one frame. These bus cycles are equally divided among the 128 processors, resulting in 3,120 memory reads or writes for each processor per frame. These 3,120 reads and writes can be shared among the 128 processing cells, so that each cell can have 24 bus cycles per frame. Since four bytes can be transferred on each bus cycle, each cell is able to import or export a total of 96 bytes per frame.

Each cell has a guaranteed number of reads before processing and a guaranteed number of writes after processing, during each frame. The system is synchronized so that all cells appear to read at the start of a frame and write at the end of the frame.

Referring again to Fig. IA, the I/O boards 18 and 22 function as dual-port global memory boards which contain high-speed parallel interfaces to the external devices 16 and 20, respectively. When data is passed from an external device to the global memory on an I/O board, it becomes immediately available to all of the processors in the system via the high-speed global bus 24. The I/O boards 18, 22 are directly connected to their respective external devices, and are each managed by a separate on-board I/O processor without requiring the use of cycles from either the global bus 24 or a processor connected to the global bus 24.

The work station 10 loads programs into the system 8 through the I/O portion of the BOSS board 12 via the high speed parallel bus 14. The host computer 16 provides data to and stores data from the system 8 through I/O board 18 coupled by parallel bus 15. Using an I/O board as a buffer, and under control of the BOSS board 12, the host computer 16 can load a program into the system, write data to the global memory and retrieve results therefrom, and can copy programs out of the global memory and store them for later execution. The graphics display 20 is coupled to the system 8 by means of I/O board 22 and bus 17, and permits high resolution graphics to be presented by the system. Referring to Fig. IB, to accomplish data transfer, the computer system employs a wide high-speed global bus 24 having, in a preferred embodiment, forty address lines, thirty-two data lines, two utility lines, three response lines, three function lines and six grant lines. The global bus 24 is preferably designed as a flat backplane whose length is limited to the length of conductors that can be driven reliably at the full bus speed, which in the present embodiment is 36 inches.

In order to minimize the propagation effects of a typically three foot long backplane, the global bus 24 uses two clock signals RCK and LCK, one propagating in each direction, as shown in Fig. IA. Backplanes of length greater than three feet are also contemplated, and would also benefit from the use of two clock signals. There is only one clock generator, which resides on the BOSS board 12.

The right-traveling clock signal (RCK) is driven by a differential ECL such as a MC10H105 "OR/NOR gate" on the BOSS board which is plugged into the highest numbered slot at the left end of the backplane. The left-traveling clock signal (LCK) is driven by a repeater 28 mounted at the right end of the backplane. The two clock signals have a known phase relationship because the returning clock signal is generated by buffering the outgoing clock signal (LCK) and then sending it back as the returning clock signal (RCK) .

The right-going clock signal clocks right-going (read) data signals, and the left-going clock signal clocks left- going (write) data signals. Signal directions are determined a priori because all processor boards are plugged into the right side of the bus, while BOSS, memory, and I/O boards are plugged into the left. For example, referring to Fig. IB, grant data always travels left-to-right, and address data always travels right-to-left. Using clock signals which travel with the data signals allows data rates as fast as those of an asynchronous bus, while enjoying the arbitration switching rate of a synchronous bus. Each board that is plugged into the backplane receives

TTL level clock signals via a clock signal receiver such as a MC10H125 receiver. Clock signal lines use differential ECL to reduce skew at each receiver. All other signals (e.g., data, address, etc.) that are synchronized to the 40 nanosecond clock signal are driven and received by DS3893 BTL "Turbotransceivers" to meet the timing requirements. These transceivers provide each board with a TTL level interface to the bus signals with separate tri-state receiver outputs and driver inputs. All drivers are preferably inserted in sockets that are directly mounted on the backplane to reduce stub length and thereby reduce reflections and capacitive loading. The backplane can be properly terminated irrespective of the number of boards that are plugged therein. Referring to Fig. 7, a preferred means for reducing stub lengths is to surface mount a device 400 to a pin grid array (PGA) carrier 402 that is mounted on the backplane 404. A daughter board can connect to the backplane 404 via a connector 406. Bussed signals are carried by lines that all reside in one buried layer that is sandwiched between two ground planes. These lines are driven by the BTL devices, and have a controlled impedance and are terminated with the impedance at each end of the backplane. Slower signals are interfaced with BTL drivers/receivers on the daughter boards. Trapezoidal drivers with limited slew rates are used to reduce the effects of the longer stubs on these lines. Stub lengths should still be kept to a minimum.

Each set of four signals shares both an active high driver enable and active low receiver enable. Most boards will ground (activate) all the receive enables, but they are wired separately for flexibility of design.

Using a fixed bus allocation scheme minimizes the number of wires and bus drivers needed to control global bus access, and reduces backplane and global bus interface complexity. The BOSS board 12 is shown in Fig. 2. A local BOSS bus

120 is coupled to a CPU 122, a ROM 124, a dynamic RAM 126 and a floating point unit (FPU) 128. The BOSS bus 120 is also coupled to a serial interface 130, a time of day clock 132 and a parallel I/O 134. A global memory 135, which is part of global memory for the system, is coupled to the BOSS bus 120 and is also coupled via buffers 136 to the global bus 24. The global memory 135 is also coupled via the local BOSS bus 120 and a memory access interface (MAI) 138 by which the system can be connected to a remote processor or workstation for remote access to the memory. The global bus grant list hardware 140 is coupled to the bus 120 and bus control signal drivers 142, providing grants to the global bus 24 via the control signal drivers 142. A global bus timing generator 144 provides timing signals to units 140 and 142 in accordance with a 100 MHZ clock 146.

To provide programmer access to the CPU 122 and to associated memories for debugging purposes, a BOSS ID bus 150 and a hacker (debugging) channel 154 are typically provided on the backplane to which the BOSS board is connected. Identification of the boards is provided by identification data conveyed via BOSS board I.D. buffers 148 to the BOSS ID bus 150. A serial USART/BTL interface 152 couples the bus 120 to the hacker channel 154.

The order of bus grants is controlled by the global bus grant list hardware 140 on the BOSS board 12. Referring to Fig. 3, the grant list is generated by software while the bus is idle, and is stored in a high-speed static RAM 212. The grant list identifies the sequence in which slots will be granted access to the global bus 24 so that a processor board residing in a particular slot may read from and write to global memory. Some bus cycles are granted to memory for refresh.

Each pass through the grant list allocates one bus cycle to each of the 128 user processors in turn. In the present embodiment, to give the 128 user processors one 40-nanosecond bus cycle requires 5,120 nanoseconds, or 5.12 microseconds. In the preferred embodiment, each user processor can transfer 4 bytes per bus cycle, permitting an overall gross data transfer rate per processor of about 780,000 bytes per second. The Read Cycles 208 and Write Cycles 216 registers contain the number of iterations through the grant list for the read and write phases of the bus frame, respectively. The grant list is transmitted on the bus 24 for some integral number of iterations while the bus 24 is in read mode, under control of software running on the BOSS board 12 using a repetitions counter 206 and a read cycles register 208. Then, this process is repeated while the bus 24 is in write mode using a write cycles register 216.

The grant list hardware 140 contains a data structure that is held in a RAM 212 that is supported by several counters 202, 206 and control hardware. Each entry in the grant list corresponds to a single bus cycle. (See Appendix for an example of a grant list.) The entry is placed on the grant lines of the global bus 24, and allocates the cycle to a slot containing a particular multiprocessor board, a memory bank for refresh, or to the slot that contains the BOSS board 12.

There are up to forty slots in the global bus and sixty- four possible values in each entry of the grant list. The slots are numbered 0 to 39. The following table shows the grant list values for the various uses:

Allocate To Use grant list entry Multiprocessor slot ID (0 - 39) memory for refresh 62 bus no-op 63

BOSS usage 63

Table 1 The grant list is placed in the RAM 212 so that it ends at the end of the RAM. A Begin Register 204 is set up to contain the index of the first element of the grant list.

The following is a description of grant list operation in steady state. Referring to Fig. 3, a frame begins with the 10-bit index counter 202 loaded from the Begin Register 204, the 16-bit repetitions counter -206 loaded from the Read Cycles register 208 via MUX 210, and the global bus in Read mode. On the first cycle, the entry in the grant list in RAM 212 at the location indexed by the index counter 202 (i.e., the value of the Begin Register 204) is placed on the global bus grant lines 214. Each global bus cycle causes the index counter 202 to be incremented, so subsequent entries of the grant list 212 are placed on the global bus grant lines 214 on subsequent global bus cycles.

This process continues until the index counter 202 reaches the end of the grant list RAM 212. Then, the index counter 202 is reloaded from the Begin Register 204, and the repetitions counter 206 is decremented. In this way, the grant list 212 is repeated continuously on the global bus grant lines 214.

When the repetitions counter 206 reaches zero, the following events occur. The index counter 202 is held (it has just been loaded with data from the Begin Register 204) , and the repetitions counter 206 is loaded with data from the Write Cycles register 216. The Dead Time Generator 218 places global bus no-ops on the grant lines 214 for the next eight global bus cycles to complete the last eight read operations. The global bus is then placed in Write mode, and the index counter 202 is enabled by the Dead Time Generator 218.

The write phase proceeds placing the entries of the grant list stored in the RAM 212 onto the global bus grant lines 214, as in Read mode. When the repetitions counter 206 reaches zero, the index counter 202 is held (it has just been loaded with the contents of the Begin Register 204) , and the repetitions counter 206 is loaded from the Read Cycles resister 208. The Dead Time Generator 218 places global bus refreshes on the grant lines 214 for the next eight global bus cycles to turn around the global bus. The global bus is placed in Read mode, and the index counter 202 is enabled.

The entire process (called a frame) is repeated indefinitely.

In short, a frame is the process of iterating through the grant list for reading and then for writing. During each frame, each user processor is given enough global memory read accesses to get the data it needs, then enough global write accesses to write its results in global memory before the start of the next frame. This process is initiated by software and is then hardware driven. The frame is repeated indefinitely without further intervention by the software.

Referring to Fig. 6, read cycles consist of address and data segments. Write cycles transfer addresses and data on the bus all during one cycle. When the bus switches from write to read mode during a read delay-time phase 300, the data bus is idle for 8 cycles while the pipeline fills. Once the pipeline is filled, data transfers occur on each clock cycle during the steady-state read phase 301. When the bus switches from read mode to write mode, the BOSS must delay grants for 8 cycles while the pipeline empties during a dead- time phase 302. The pipeline again fills during a write delay time phase 303, whereupon a steady-state write phase 304 begins. A software controllable interframe dead-time 305 must elapse before the next read cycle begins. The write pipeline does not need to empty before the read phase starts, and in fact the interframe dead-time could be eliminated to reduce overhead in systems with much shorter frame times. As mentioned above, this dead-time may be synchronized to external devices.

Fig. 6A shows the timing relationship of the grant lines with the other bus signals, i.e., the address & function, data, and response signals. Note that there is a two clock cycle difference between the grant lines and the address and function lines.

The global bus assumes that all transactions are memory update cycles or DMA (direct memory access) transfers between the local memory of a microprocessor and the global memory.

All global bus accesses are 32 bits in width. ALU functions on the global memory boards and I/O boards allow processors to update partial words using AND, OR and Exclusive OR logic. This allows a global memory word to be divided and distributed among a plurality of processors, each processor updating its portion of the word, independently of other portions. This is useful for arrays of bytes as well as string data or video pixel arrays. The ALU can also implement read-with-increment or read-with-decrement for "take-a-number" style resource allocation.

Figs. 4A through 4H are timing diagrams showing reading and writing operations for the global memory. Each processor is given sufficient access to the global bus 24 to write the results of its previous cycle to global memory and to read data from global memory for its next cycle. Because the bus 24 is fast, each processor must set up its data transfer before it has bus access, and to optimize efficiency, a processor which is not ready to put data on the bus 24 at the proper time loses its turn. Fig. 4A shows the global bus cycle where each cycle is about 40 nanoseconds. Data transfer to or from a processor takes three bus cycles. At a first cycle rising edge Al, processor number 0 is granted a bus cycle by the BOSS board 12, and receives a grant during cycle B-0, shown in Fig. 4B. Global bus input signals are strobed into registers on each processor board at the rising edge of the bus clock, without any gating or decoding.

At the rising edge A2 of the second bus cycle, the processor receives the actual grant during period C-0 of Fig.

4C, but it still needs time to compare the grant to its own slot, comparing the allocated bank to the needed bank, comparing the state of write/read register to the allocated transfer direction, so that it can use the cycle. If verification is correct, at the rising edge A3, the processor caries out transactions during time DO, i.e., read or write.

The process is repeated one cycle later for the next processor in the sequence. By overlapping the three phases of the transfer cycle, note that the computer system can start a new bus cycle every 40 nanoseconds, even though the individual processors cannot operate that fast.

A write of data from the local memory of a processor to global memory takes one bus cycle, during which time data is transferred from high speed registers associated with the processor to high speed registers on the appropriate global memory board. The data is then written from the fast registers on the memory board into the appropriate memory location during the subsequent seven bus cycles.

A read of data from global memory is split into address and data sections and thus transfers data to a processor during the opposite halves of two different bus cycles. Since memory cannot be read in a single bus cycle, data is returned eight bus cycles later. As shown in Fig. 4, during the first bus cycle, the address of the global memory word is transferred from high speed registers associated with the processor to fast registers on the memory board during cycle D-0. The memory obtains the data during time period E-0, which is eight cycles long. When the memory finishes cycling and the data becomes available, another bus cycle A4 transfers the data from the fast registers in the memory board to the fast registers associated with the processors, during period H-0.

To assure that all global data is updated coherently, the processors operate synchronously so that no processor starts a new timing frame until all of the other processors have finished their frame. A multi-word write to global memory from a single user processor must appear to be atomic, and look as though the user processor performed the write operation before any other user processor can access the data. Any variable requiring more than a one word transfer must be updated without skipping an available cycle. A multi-word variable must be updated in specified order using every cycle available for the purpose. All data may be transferred using multiple reads during a single bus cycle. Each frame allows every processor to read from a single location in memory and allows every processors to write to a single location in memory.

The bus of the invention is applicable to multiprocessor systems such as that disclosed in U.S. Patent No. 5,136,717,1988, wherein each user processor is divided into a plurality of processor cells. The bus of the invention provides guaranteed access to a global memory by each cell.

MEMORY INTERLEAVING

The main database for the computer system 8 resides in global memory located on the global memory boards 26 and 27. In the present embodiment, the global memory is arranged in units of 32 megabytes per global memory board. Each global memory board is divided into eight memory banks, wherein each memory bank includes multiple memory chips. The bus cycles at 40 nanoseconds, and the memory chips cycle more slowly at 200 nanoseconds, so the memory banks are accessed in sequence, but no more often than once every eight bus cycles, or every 320 nanoseconds. This process, referred to as eight-way memory interleaving, gives the memory chips time to return to equilibrium. Eight-way memory interleaving also permits a memory cycle to start on each bus cycle. The additional 120 nanoseconds provides time to perform error scrubbing and locked updates.

Bank ordering is enforced by the BOSS. A multiprocessor board receives a grant for a particular bank and must verify that the granted bank is the bank required. Otherwise, the multiprocessor board must wait for the correct bank to come up in the interleaving sequence. A processor board which follows the standard sequence thus does not have to wait. Even if the standard sequence is not followed, each multiprocessor board is guaranteed access to the appropriate memory bank at least once every frame.

Memory interleave sequencing is determined before the BOSS allows any reads or writes. Memory interleaving is optimized by arranging data addresses so as to' maximize the utilization of memory.

The memory interleaving of the invention is hardware- enforced to prevent any bus master (processor board) from locking another bus master out of its assigned bank. Thus, when a slot containing a bus master is granted a bus cycle for a particular bank, that bank is guaranteed not to be busy, no matter what the previous bus master(s) have done. This is not the case in systems with random memory interleaving, such as the SEL bus.

The way the bus implements this interleaving scheme will now be discussed, including an alternate embodiment that implements a method by which a reduced system (one which has fewer backplane slots) can use the same bus and get the benefit of a reduced interleaving factor.

The bus accepts a new data transfer every 40 nanoseconds. From the time a data address is valid at the input pins of a typical dynamic random access memory (RAM) chip, 100 nanoseconds are needed to access the data stored therein. Dynamic RAMs also must "precharge" internal circuitry before the data stored therein can be accessed again, requiring another 100 nanoseconds. Thus, about 200 nanoseconds are required to perform a complete (read or write) memory cycle. Moreover, some memory cycles also include an arithmetic or logic function between a read and a write operation. This allows commands such as "read and increment" and "add data to memory". To do this with typical dynamic RAM chips, the memory needs to cycle at 320 nanoseconds, or 8 bus clock periods. Thus, to allow a transfer to take place on every clock period, the memory must be distributed over 8 banks, i.e., be eight-way interleaved.

Since each bank requires 8 clock cycles to complete an operation, the availability of banks on each clock cycle depends on transfers which occurred during the previous seven cycles. In a multiprocessing system such as the one incorporating the bus of the invention, the bus is allocated to a new master on each clock cycle, and the previous seven bus masters determine the bank(s) available to the current master. In a system which allows processors to request any bank at random, called a randomly interleaved system, bank access is not predictable. In fact, there is a non-zero probability that a bank will be unavailable to a given bus master at any given time. This is not acceptable for a systolic, real-time system such as a system that incorporates the bus of the invention, as described in U.S. Patent No. 5,136,717, herein incorporated by reference.

The bus of the invention avoids this problem by preassigning each clock cycle to a particular bank of memory. The grants to each slot are made synchronously to the bank selection circuitry. Thus, each access is guaranteed to coincide with a particular bank, and the bus master at any given slot knows a priori which bank will be available on the next granted access. Referring to Fig. 8, in this embodiment, the address bits 500 which correspond to bank selection (bits A2, A3, and A4, for example) are not transmitted over the backplane bus and provided via the bus connector 502. Instead, a signal called BANKSYNC 504 is used to synchronize bank counters located on each of the boards in the system. The bank counters are implemented as a programmable array logic device (PAL) 506, as in the global memory board of Fig. 8, and in the PAL 606 of the processor board of Fig. 9. These PALs 506 and 606 count through eight banks in one of four sequences as determined by a pair of Bincr<2:l> select lines 508 and 608, respectively. The PALs 506 and 606 are reset every 8th clock cycle (given that there are 8 banks) by the banksync signal 504, and are used to determine the bank which is available for use on any granted cycle, as represented by the address signals 500. These signals 500 are decoded by a decoder 510, which in turn provides a plurality of bank select signals 512 to a plurality of memory banks (not shown) . A base address register 514 receives a plurality of address signals 516, and provides a plurality of base address signals 518 to a comparator 520. The comparator 520 receives the plurality of address signals 516 and the plurality of base address signals 518, and provides an accept signal 522 upon detecting a condition wherein the plurality of address signals 516 matches the plurality of base address signals 518.

Circuitry on each processor board, e.g., as shown in Fig. 9, compares the global memory address A0, A2, A3, A4 requested by its EXIM list 610, and uses a granted cycle only if this address matches the value 600 of the bank counter 606 and the write line 607 as determined by a comparator 612. In the event of a match, and if slot signals 609 match the grant signals 611, the comparator 612 provides an accept cycle signal 614. Each physical slot accepts a multiprocessor board with, for example, four processors on it. The grants, however, only indicate a physical slot number, not the processor on the board residing in that slot which is being granted a particular bus cycle. Therefore, in this case the grant list is arranged such that each physical slot appears in the list four times. Circuitry on each multiprocessor board assigns each of the four allocated cycles in a round-robin fashion to one of the four processors on that board. The processors on each multiprocessor board share the bus interface circuitry and this circuitry is too slow to use multiple adjacent bus cycles. Consequently, in a system with all slots occupied, the grant list is arranged such that only one cycle is granted to each slot before any given slot receives another granted cycle. Thus, a system with 128 processors, for example, would require a grant list with 128 grants allocated as four grants to each of 32 slots. Successive grants to any single slot would be separated by at least 31 bus cycles, during which all of the other slots received grants.

Because the banks are rotated regularly every eight clock cycles, the order in which each slot is assigned banks only depends on the length of the grant list, modulo eight. If the grant list allocated four bus cycles to each of the possible 32 slots, the length of the grant list, modulo eight, would be zero. If this were the case, each processor would always receive a grant of access to the same bank, thereby preventing the processors from accessing data stored in the other seven banks. However, if the grants can be made to precess through the banks, wherein a different bank is available to a processor every iteration of the grant list, then the processor can access all 8 banks. To make the grants precess through the banks, additional cycles are added to the list which are not assigned to any slot, but are rather used to refresh dynamic memory. By adding three refresh cycles after the 128 granted cycles, the dynamic memory is sufficiently refreshed to retain data, and the processors are granted access to all banks in sequence before being granted the same bank again.

Referring to Fig. 10 and Table 1, to reduce the complexity of the software which loads addresses into the EXIM list, each bank counter, e.g., the PAL 506 of Fig. 8, counts by three (0, 3, 6, 1, 4, 7, 2, 5) which grants the banks in ascending order of address to any slot. Fig. 10 shows a grant list for implementing normal 8-way interleaved operation. The upper row 910 of the grant list shows bank ordering, and the lower row 912 of the grant list indicates the corresponding granted slot. For example, the fourth slot 914 receives access to the fourth memory bank 916, and the sixth slot 918 gets access to the second memory bank 920. Note that each time slot zero receives a memory bank allocation, the number of each successive allocation increases by one. So, for example, the first time it receives access, such access is to the zeroth bank; the second time it receives access, such access is to the first bank; the third time it receives access, such access is to the second bank; and so forth. The processors residing at each slot are guaranteed to start a frame on a particular bank because the circuitry on the Boss board always starts the grants for each frame at the beginning of the grant list, and in synchronization with the BANKSYNC signal.

Each processor preallocates small sections of the EXIM list to each of the "cells" it emulates. In the current embodiment, sixteen import words and eight export words are assigned to each cell. Due to the need to guarantee resources to each cell, these EXIM list sections are fixed. Each cell is therefore allocated sixteen import words as two words from each of eight banks, and eight export words as one word from each of eight banks. Thus, in a randomly connected system, access would be guaranteed to only two imports and one export per cell. Thus, it is desirable to reduce the effective interleaving factor whenever possible so as to guarantee more than two imports and more than one export per cell.

EFFECTIVE REDUCTION IN INTERLEAVING FACTOR

In a reduced system where the bus has no more than one half of the maximum complement of slots, as illustrated in Tables 2-4, a reduction in effective interleaving factor is possible by utilizing the spare bandwidth of the memory system. Spare bandwidth results when the number of slots is less than the maximum possible number of slots in a grant list. Using a grant list according to the method of the invention, such a reduced system can use the same bus, and therefore the same grant list hardware and software, and get the benefit of a reduced interleaving factor in the following manner: Each slot is granted more than one successive cycle each time it is granted the bus. In a system with 16 slots out of a possible 32 slots, as shown in Table 2, wherein 16 slots allows a maximum of 64 processors*, two successive cycles granted to each slot results in a reduction factor of 2, thereby providing an effective interleaving factor of 4 instead of 8 in a system using a board with 8-way memory interleaving.

For another example, Fig. 11 and Table 3 show a 128-item grant list with only 8 granted slots allowing no more than 32 processors. In this system, four successive cycles are granted to each slot. If there were no more than 16 processors, 8 successive cycles would be granted to each slot.

However, a processor residing in a slot is incapable of using successive bus cycles for each successive memory access due to limitations in the speed of the EXIM list circuitry. Nevertheless, the additional cycles can be used to reduce the effective interleaving factor. This is accomplished by waiting for the cycle which matches the bank requested by the address in the EXIM list. A bank mismatch is only declared if all successive grant cycles go by without a bank match.

To ease software requirements, the bank counters (implemented as PALs in the present embodiment) change their count sequence in accordance with the interleave reduction factor, and the grant list changes the number of refresh grants to provide the proper bank precession. For example, as shown in Fig. 11, in a system with 8 slots allowing no more than 32 processors, each slot appears four times in the grant list, followed by four refresh grants. The bank counters count in the sequence 0, 2, 4, 6, 1, 3, 5, 7, and thus the slots see a system with effectively two banks (0,2,4,6 and 1,3,5,7) which are selected by only the least significant bit (bit A2 of Fig. 8) of the original bank select address lines. The bus has two lines Bincr<2:l> 508 which encode the desired interleave reduction factor. These lines allow the bank counters 506, 606 on each board to count in the proper sequence, and allow any system to change the effective interleave reduction factor via software. With reference to Fig. 12, advances in the speed and density of static RAMs have made it possible to make memory boards which are truly two-way interleaved at the bus transfer rate. They can perform a read/ALU-operation/write cycle in just two 40 nanosecond bus clock cycles. To exploit these capabilities, the alternate embodiment of a global memory board shown in Fig. 12 includes bank select address lines Address<4:2> 703 that are received by a decoder 710 which in turn provides a plurality of bank select signals 712 to a plurality of respective memory banks (not shown) . Bank select address lines can also be used to provide an embodiment with one-way interleaved (40 nanosecond) memory.

By contrast with the embodiment of Fig. 9 wherein the processor boards check the bank counter PAL 606 to determine bank availability, in the memory board of the alternate embodiment of Fig. 12, the PAL 706 checks the bank select address lines 703 and issues a "bank busy" response on the Cycle Accept line 707 if that bank is not expected on the corresponding cycle. Note that this must be done even if the bank is not really busy, because using the bank at the wrong time could cause it to be left busy when it is expected to be available.

In the alternate embodiment of Figs. 12 and 13, the count sequence that determines bank availability is a function of the memory board's interleave factor (the number of memory banks on that memory board) as well as the system interleave reduction factor, as determined by the Bincr<2:l> lines 708.

Also, the PAL 706 is responsive to a board selection signal 714 provided by a comparator 716 that compares a plurality of addresses 718 with a plurality of base addresses 720 provided by a base address register 722.

With reference to Fig. 13, another embodiment of a processor board includes a first PAL 806 that receives a synchronization signal 804 that resets the PAL 806. The PAL 806 also receives a sequence select signals 808 for selecting one of a plurality of memory bank allocation sequences. The PAL 806 provides a plurality of memory bank address signals 800.

An EXIM list module 810 provides a plurality of EXIM list address signals 811 in accordance with an EXIM list, including an acceptance condition signal 815. A first comparator 812, coupled to the PAL 806 and to the EXIM list module 810, receives the plurality of memory bank address signals 800, the plurality of EXIM list address signals 811, and a write signal 807. By comparing these signals the first comparator 812 providing a bank match accept signal 816 upon detecting a condition wherein the plurality of EXIM list address signals 811, excepting the Al signal 815 match the plurality of memory bank address signals 800 and the write signal 807. A second comparator 813 receives a plurality of grant signals 817 and a plurality of slot signals 818, and provides a slot match signal 819 upon detecting a condition wherein the plurality of grant signals 817 match the plurality of slot signals 818.

A second PAL 820 is coupled to the EXIM list module 810 and to the second comparator 813, and receives the Al signal

815, the slot match signal 819, and the bank match signal

816. The second PAL 820 provides an accept cycle signal 814 upon detecting a condition wherein the slot match signal 819 and the bank match signal 816 are in a first logic state, such as true, and the Al signal 815 is in a second logic state, such as false. The second PAL 820 also provides an accept cycle signal 814 upon detecting a condition wherein the Al signal 815 is in the first logic state, the slot match signal 819 is in the first logic state, and the slot match signal 819 was not in the first logic state on the prior bus cycle.

In the embodiment of Figs. 12 and 13 interleave reduction is accomplished as follows. In a system with 16 granted slots out of a possible 32 slots, and a memory board that implements two-way memory interleaving, for example, the interleaving can be reduced to effectively one-way interleaving (i.e., non-interleaved) by granting each of the 16 slots eight cycles as four pairs of two successive cycles, and in a system with four processors per board, each of 64 possible processors receive two successive cycles. Instead of determining which of the two granted cycles corresponds to the bank a processor board needs, the processor board always uses the first granted cycle. The memory board then makes available both banks on every other cycle which corresponds to the cycles allotted to processor board usage.

Thus, each of the plurality of processors accepts an allocation of a currently available memory bank of the memory resource only upon satisfaction of an acceptance condition. In the embodiment of Figs. 8 and 9, the acceptance condition is satisfied when an address of the currently available memory bank corresponds to an address in an EXIM list of the processor granted the memory resource. In the embodiment of Figs. 12 and 13, the acceptance condition is satisfied when the currently available memory bank is the first memory bank granted to a processor. In a preferred embodiment, the acceptance condition is determined by the state of an acceptance condition bit on each board.

The following tables show various exemplary interleave reduction factors with corresponding bank availability order for three (2-, 4-, and 8-way) memory board interleaving configurations:

TABLE 1

REDUCTION FACTOR 2 GRANTED SLOTS: 16 MAX PROCESSORS: 64 REFRESH GRANTS: 6 BANK ORDERS:

8-WAY (Effectively 4-way) 0,4 3,7 - 2,6 - 1,5 - 4-WAY (Effectively 2-way) 0,2 1,3 2-WAY (Effectively 1-way) 0,1

TABLE 2

REDUCTION FACTOR 4 GRANTED SLOTS: 8 MAX PROCESSORS: 32 REFRESH GRANTS: 4 BANK ORDER:

8-WAY (Effectively 2-way) 0,2,4,6 1,3,5,7 4-WAY (Effectively 1-way) 0,1,2,3 2-WAY (Effectively 1-way) 0,1

TABLE 3

REDUCTION FACTOR 8 GRANTED SLOTS: 4 MAX PROCESSORS: 16 REFRESH GRANTS: 8 BANK ORDER:

8-WAY (Effectively 1-way) 0,1,2,3,4,5,6,7 4-WAY (Effectively 1-way) 0,1,2,3 2-WAY (Effectively 1-way) 0,1

TABLE 4 The ordering for the 8-way interleaved board configuration is the same for both the embodiment of Figs. 8 and 9 and for the embodiment of Figs. 12 and 13. In fact, it is possible for a system to accommodate boards of both embodiments. To accomplish this, a processor board stores a bit Al with each item as part of the global address in the EXIM list that indicates the embodiment of the board being addressed. If the bit indicates a board of the embodiment of Figs. 8 and 9, the scheme of comparing the bank counter to the address is used. If the bit indicates a board of the embodiment of Figs. 12 and 13, the first granted cycle is used. The grant list is the same for either embodiment, thereby allowing systems with boards of each embodiment to run together properly.

INSERTION AND REMOVAL OF BOARDS

The ability to insert and remove boards (both manually and automatically) with the bus powered and operating is mandatory. Problems arise due to surge currents and the unspecified operation of components that are in process of being powered up.

Accordingly, circuitry is included in each board that optionally inhibits any signals from being injected onto the bus. Control of the option is through a separate and independent sub-bus added to the global bus and going to all boards. This sub-bus, sometimes referred to as a back channel, is controlled by the BOSS board and also accesses a dual-ported memory that is used for message-based communication between the BOSS and any other board.

With these facilities (and the provision that a board that is inserted into an operational bus is guaranteed not to interfere with the bus's current operations) the BOSS may interrogate a newly installed board, ascertain its revision level and embodiment type, prepare for its entrance into the system, and inform the user of any of the above before the board is allowed access to the bus. Additionally, the BOSS can query any I/O board as to its internal status, or can command it to perform on-line diagnostics. If any of the responses it receives are inappropriate, the BOSS may remove it from the bus so that system integrity is guaranteed.

Glossary

Arbitration - The resolution of ownership disputes among several data devices requesting access to a bus.

Bank - A group of memory chips and associated logic which cycle in tandem.

BOSS board - Control logic that schedules all accesses to the

Bus. A grant list stored in memory on the BOSS board determines the order in which slots, and any associated resident user processors, are allocated bus cycles during each frame. The BOSS board also includes a parallel interface that shares some of global memory with an attached workstation.

BTL - Backplane Transceiver Logic. A bus-interface logic- family that employs open collector outputs with low voltage swings and precision reference adapted to drive low impedance signal lines without reflections and without crosstalk.

Bus - A large number of conductors that cycle with a period of 40 nanoseconds and transfer 32 bit words of data to and from the user processors and global memory. The term 'bus' may refer only to the conductors which actually carry the data, but more typically refers to the bus sub-system, which in the instant invention includes the conductors, the BOSS board that orchestrates access to the conductors, and associated hardware and software. The bus system is the primary subject of this invention.

Cell - The minimum allocatable unit of hardware resources.

A typical cell has 16 words of memory for data read from global memory, and 8 words of memory for data to be written to global memory, and runs about 350 instructions during every frame.

Cell time - The portion of a frame allotted to a cell for processing. Each frame comprises a fixed number of cell times.

Cycle - A unit of time during which a device completes a repeatable operation. The duration of a cycle depends on the operation.

Data coherence - A transfer of data to a memory is said to be coherent only if partial results cannot be read from memory.

Dead time - Time inserted by the BOSS board between a read phase and a write phase. Dead time is used to let bus lines settle before the direction of data flow reverses, and to synchronize the computer system with external signals.

Frame - A period of time during which every cell in every user processor is given a chance to read data from global memory, run any code in the cell, and write data to global memory. In a preferred embodiment, the frame cycles at 60 Hz, but can be set up to run at different rates.

Global memory - Memory accessible to all user processors in the computer system. During each frame, data are read from global memory, and processed, the new values being written back to global memory.

Grant - Preallocated bus access provided to a slot which may hold a processor board for enabling the processor board to read from or writing to the global memory. Only one bus cycle is allocated per grant, allowing access to one memory bank.

Grant list - Hardware registers on the BOSS board containing the schedule of bus grants to all the slots.

Interleave - A method of synchronizing access to a series of memory banks so that they can be accessed more frequently than the cycle time of the banks' memory chips. If there are N banks, the beginning of each bank's cycle is delayed by the period of the memory cycle divided by N with respect to the previous one.

Thus, one bank is available each 1/Nth of the period of the memory cycle.

Local memory - Banks of memory which are dedicated to local processors. Processors store programs and intermediate data in local memory, transferring only final results to global memory.

Phase - A period during which data on the bus is traveling in a particular direction. Each frame is composed of one read phase and one write phase. A phase is typically very long compared to the period of a bus cycle.

Real-time - Either fast enough with respect to external events so that computer speed is not a problem, or predictable such that the interval between receiving input and generating appropriate output does not vary with the computer load.

Slot - A physical connector on the Bus which may hold a board. (BOSS, memory, multi-processor, or I/O) Turbotransceiver - A high-speed backplane signal transceiver designed for Futurebus. See BTL. Turbotransceiver is a trademark of National Semiconductor Corporation.

User processor - A user processor reads data from global memory, performs computations on the data, and then writes the results to global memory. User processors reside on a processor board.

Other modifications and implementations will occur to those skilled in the art without departing from the spirit and the scope of the invention as claimed. Accordingly, the above description is not intended to limit the invention except as indicated in the following claims.

Claims

What is claimed is:

1. A method for allocating a memory resource that includes M memory banks to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, the method comprising the steps of, for each frame: for each slot, giving the slot access to the memory resource T times successively; and for each time a slot is given access to the memory resource, accessing a memory bank of the memory resource in accordance with a repeating sequence of M memory bank addresses, thereby providing each slot with guaranteed access to each memory bank for each frame.

2. The method of claim 1 wherein within each repetition of the repeating sequence of M memory bank addresses, no memory bank address is repeated.

3. The method of claim 1 wherein T=2, thereby providing an interleave reduction factor of no more than 2.

4. The method of claim 1 wherein T=4, thereby providing an interleave reduction factor of no more than 4.

5. The method of claim 1 wherein T=8, thereby providing an interleave reduction factor of no more than 8.

6. The method of claim 1 wherein M=8, thereby providing an eight-way memory resource type.

7. The method of claim 1 wherein M=4, thereby providing a four-way memory resource type.

8. The method of claim 1 wherein M=2, thereby providing a two-way memory resource type.

9. The method of claim 1 wherein each slot of the backplane of the bus can hold a processor board, the processor board including a plurality of processors.

10. The method of claim 9 wherein the processor board includes four processors.

11. The method of claim 9 wherein a sequence of grants of access to the memory resource are allocated to each of the plurality of processors in accordance with a cyclical allocation scheme.

12. The method of claim 11 wherein each of the plurality of processors accepts an allocation of a currently available memory bank of the memory resource only upon satisfaction of an acceptance condition.

13. The method of claim 12 wherein the acceptance condition is satisfied when an address of the currently available memory bank corresponds to an address in an EXIM list of the processor granted the memory resource.

14. The method of claim 12 wherein the acceptance condition is satisfied when the currently available memory bank is the first memory bank granted to a processor.

15. The method of claim 12 wherein the acceptance condition is satisfied when the currently allocated memory bank is the first of successive granted memory banks, and the currently allocated memory bank is available.

16. The method of claim 12 wherein the acceptance condition is determined by the state of an acceptance condition bit.

17. A processor interface module for use with a memory resource that includes M memory banks that are allocated to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, the processor interface module comprising: a logic device for receiving a synchronization signal for resetting the logic device, and a sequence select signal for selecting one of a plurality of memory bank allocation sequences, the logic device also providing a plurality of memory bank address signals; an address source for providing a plurality of address signals; a comparator, coupled to the logic device and to the address source, for receiving a write signal, the plurality of address signals, a plurality of grant signals, and a plurality of slot signals, and for providing an accept signal upon detecting a condition wherein the plurality of grant signals match the plurality of slot signals, and wherein the plurality of address signals match the plurality of memory bank address signals and the write signal.

18. The processor interface module of claim 17 wherein the address source is an EXIM list module for providing a plurality of EXIM list address signals in accordance with an EXIM list.

19. A memory interface module for use with a memory resource that includes M memory banks that are allocated to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, the memory interface module comprising: a logic device for receiving a synchronization signal for resetting the logic device, and a sequence select signal for selecting one of a plurality of memory bank allocation sequences, the logic device also providing a plurality of memory bank address signals; a decoder, coupled to the logic device, for receiving the plurality of memory bank address signals, and for providing a plurality of bank select signals; and a board selection detector, cooperative with the logic device and the decoder, for detecting that the board has been selected and providing a cycle accept signal.

20. The memory interface module of claim 19 wherein the board selection detector includes: a base address register for receiving a plurality of address signals, and for providing a plurality of base address signals; and a comparator, coupled to the base address register, for receiving the plurality of address signals, and the plurality of base address signals, and for providing an accept signal upon detecting a condition wherein the plurality of address signals match the plurality of base address signals, thereby indicating that the board has been selected.

21. A processor interface module for use with a memory resource that includes M memory banks that are allocated to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, the processor interface module comprising: a first logic device for receiving a synchronization signal for resetting the logic device, and a sequence select signal for selecting one of a plurality of memory bank allocation sequences, the logic device also providing a plurality of memory bank address signals; an address source for providing a plurality of address signals, including an acceptance condition signal; a first comparator, coupled to the logic device and to the address source, for receiving the plurality of memory bank address signals, the plurality of address signals, and a write signal, and for providing a bank match accept signal upon detecting a condition wherein the plurality of address signals, excepting the acceptance condition signal, match the plurality of memory bank address signals and the write signal; a second comparator for receiving a plurality of grant signals and a plurality of slot signals, and for providing a slot match signal upon detecting a condition wherein the plurality of grant signals match the plurality of slot signals; and a second logic device, coupled to the address source and to the second comparator, for receiving the acceptance condition signal, the slot match signal, and the bank match signal, and for providing an accept signal upon detecting a condition wherein the slot match signal and the bank match signal are in a first logic state and the acceptance condition signal is in a second logic state, and upon detecting a condition wherein the acceptance condition signal is in a first logic state, the slot match signal is in the first logic state, and the slot match signal was not in the first logic state on a previous bus cycle.

22. The processor interface module of claim 21 wherein the address source includes an EXIM list module for providing a plurality of EXIM list address signals in accordance with an EXIM list.

23. A memory interface module for use with a memory resource that includes M memory banks that are allocated to a plurality of slots of a backplane of a bus that operates in accordance with a sequence of frames, each frame having a plurality of bus cycles, the memory interface module comprising: a board selection detector, coupled to the logic device, for detecting that the board has been selected and providing a board selection signal; a logic device for receiving a synchronization signal for resetting the logic device, a sequence select signal for selecting one of a plurality of memory bank allocation sequences, a second plurality of address signals, and the board selection signal, the logic device also for providing an accept signal; and a decoder for receiving the second plurality of address signals, and for providing a plurality of bank select signals.

24. The memory interface module of claim 23 wherein the board selection detector includes: a base address register for receiving a first plurality of address signals, and for providing a plurality of base address signals; and a comparator, coupled to the base address register, for receiving the first plurality of address signals, and the plurality of base address signals, and for providing the board selection signal upon detecting a condition wherein the first plurality of address signals match the plurality of base address signals.