WO1987002800A1

WO1987002800A1 - Integrated, multicomputer data processing system

Info

Publication number: WO1987002800A1
Application number: PCT/US1986/002262
Authority: WO
Inventors: Glen J. Culler; Robert B. Pearson; Michael Mccammon; William L. Proctor; John L. Richardson
Original assignee: Culler Scientific Systems Corporation
Priority date: 1985-10-24
Filing date: 1986-10-24
Publication date: 1987-05-07
Also published as: EP0244480A1; JPS63501904A

Abstract

An integrated, multicomputer data processing system having a control section (108) including system input and output (102) and at least one user processor section (200) having a math processor (252) is shown. The control section includes an operation system processor (140) for scheduling and assigning processing tasks and for controlling the transfer of data and instruction addresses to each user processor section (200). A system controller (154), operatively connected to the operating system processor (140), receives the command signals and assigned tasks and applies the same to a user processor section (200). The data processing system further includes a system memory bus (132) for data transfer and a staging bus (136) for instruction and bidirectional data transfer, which are connected to the system controller (154) and to each user processor section (200).

Description

INTEGRATED, MULTICOMPUTER DATA PROCESSING SYSTEM

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to a data processing system having a plurality of user processors for performing parallel numerical operations assigned to the user processors and more specifically to an integrated, multicomputer data processing system having a user processor section wherein the user processor includes an organizational processor and math processor capable of operating in parallel for performing numerical operations by executing specific portions of an instruction stream in parallel.

2. Description of the Prior Art The use of an array processor for performing calculations utilizing data from two distinct two-dimensional arrays wherein the array processor incorporates multiplication, addition, subtraction, transfer and storage means whereby the arithmetic and storage operations and the two-dimensional array axis operations can be connected in parallel is disclosed in U.S. Patent 4,287,566 issued to one of the inventors herein Glen J. Culler. Other known array processors operate in a similar manner as that described in U.S. Patent No. 4,287,566.

It is also known in the art to attach an array processor to a so-called "super mini" computer for use in complex scientific and engineering calculations. Typical of such combination is an DEC Model VAX 780 computer as the host computer having a Floating Point Systems Inc. Model AP 164 array processor.

In the stored program digital computer systems known in the art, the basic process of executing instructions has not changed significantly.

Certain of these approaches are described in FIGS. 1, 2 and 3 which are identified as "prior art" in the drawings. FIG. 1 discloses a multicomputer system which increases processing speed of a computer by use of a plurality of independent computers operating in parallel. The multicomputer system of FIG. 1 is capable of handling multiple tasks by use of multiple processor/memory combinations which perform parallel task execution. Such a multicomputer system is generally referred to as a "distributed" system because the system operates on the principle that separate tasks, illustrated by rectangles 30 in FIG. 1, are distributed to separate computing systems through memories, illustrated by rectangles 34, which are operatively coupled to a plurality of parallel CPU's designated by rectangles 36. In such a multicomputer system, the performance limits of each of the individual CPU's 36, which typically are individual microcomputers, operate in the 1.5 millions of instructions per second ("MIPS") to 2 MIPS of integer performance, but which have limited floating point performance levels. In order to render the microcomputer system of FIG.l operative, efficient multicomputer systems are required which maintain close coupling mechanisms. The multicomputer system of FIG. 1 requires that the operating system design be inherently capable of managing multiple task processors efficiently. Typically, the multicomputer systems illustrated in FIG. 1 are utilized in low-cost applications having a large volume of low-performance tasks to be performed such that the tasks can be distributed among several microcomputers. Typically, such low-performance tasks include business and personal computer applications. Multicomputer systems of the type illustrated in FIG. 1 are limited with respect to high-performance and high math applications such as, for example, bound scientific applications including simulations, signal processing, image processing, and the like.

Another known approach to increase the processing speed by use of a coprocessor system which is illustrated in FIG. 2 , captioned "Prior Art." The coprocessor of FIG. 2 is capable of handling single tasks by use of pre/post processors, based upon parallel, multi-CPU instruction execution. Another known approach for increasing processing speed of large-scale digital computer is generally referred to as a pipelined computer system such as that illustrated in FIG. 3 labeled "Prior Art." In the pipelined computer system, the computer system relies on certain known parameters such as, for example, that certain types of computer operations are deterministic, that is, predictable to a high degree.

It is known in the art for supermini computers to use a combination of distributed multicomputers, portions of the cache-supported pipelined approach, and sophisticated high-speed logic circuitry to attain high operating speeds. In such supermini computers, the central processor interacts with an interleaved memory to obtain the maximum of overlap in hardware operations to reduce overall operating time. The interleaved memory, cache-supported, overlapped fetch mechanisms can add substantially to serial execution speeds, particularly in nondeterministic applications. In this architecture, the cache is a major contributor to the speed, since it acts as a super high-speed "scratchpad" memory for the computer, that is as a means for the parallel prefetch hardware which races ahead to fetch data from memory, to store data, as well as a place for the processor to quickly find the data, when it is time to start the next instruction. A cache memory may be cycling at 50 nanoseconds, compared to regular memory which may be cycling at 400 nanoseconds, such that considerable time can be saved. However, when a branch instruction or subroutine is required, the memory cache must be emptied and reloaded with instructions starting at the new branch location.

One well known, high-speed scientific digital computer which utilizes parallel processing in combination with a preprocessing stage and instruction reservation system is described in U.S. Patent 3,346,851. U.S. Patent No. 3,895,357 discloses the use of a buffer memory arrangement for a digital television display system. SUMMARY OF THE PRESENT INVENTION

This invention relates to a new, novel and unique modular multicomputer system which utilizes a computer architecture which is based upon a modular multicomputer design in which a plurality of integrated user processors operate in parallel from instructions which form part of an instruction stream under scheduling provided by an operating system processor and a system controller within the integrated, multicomputer data processing system.

The integrated, multicomputer data processing system of the present invention has a control section which includes means for a system input and output and user processor sections having a math processor and an organizational processor which loads and unloads data and instructions enabling the math processor to operate at maximum efficiency.

The user processor section includes means responsive to the addressing signals for fetching instructions from a loaded instruction cache in the user processor section and for enabling the math processor to execute programmed arithmetic operations on the loaded data signals in response to command signals and tasks scheduled for the user processor by the operating system processor via the system controller. In the prior art computer systems described above, the computer architecture is based upon the type of operations to be performed by the computer, be they determinative or nondeterminative operations. In the prior art computer systems, high-speed, special purpose computers are formed by a combination of scientific computers with special purpose, high-speed array processors to handle the determinative calculations which are generally required for bound applications. However, the disadvantage of such systems is that if the instruction mix is such that nondeterminative or nonvector processing is required, the average speed of the processors is reduced because of the fact that the special purpose, high-speed digital computers cannot operate efficiently. The known prior art systems utilize inherently serial machines which utilize the von Neumann serial instruction sequence. Improvements in operating speeds are obtained by various means, including use of cache memories, pipelined computer systems, and methods for increasing computer speedby performing plurality of functions in parallel.

In typical high-performance scientific applications, the processing problems comprise a blend of integers and array processing requirements. If a large amount of array processing content is contained with the scientific application, supermini computers are used. However, they have slow floating point capability and, as such, are not able to perform the desired array processing high speeds. Thus, the use of an array processor as a special purpose, high-speed digital computer is required in combination with a scientific computer to obtain the desired speeds. This results in a higher price-to-performance ratio. usually, computer systems having the lowest price-to-performance ratio are desired for scientific applications.

In the known prior art computer systems which utilize the pipelined computer system concept, due to the mix of vector and nonvector processing requirements, programming a supermini computer to perform the various vector/nonvector calculations is a time-consuming task which reduces the desirability of utilizing such a system. A further problem in utilizing a supermini computer using the pipelined computer concept is that the higher level machine language such as, for example, FORTRAN 77 or C, cannot be used and it is necessary to write the computer program in a language which is compatible with the intricacies of machine language.

The general overall solution to increasing computer speed, where there is a mix of determinative and nondeterminative calculations, is to utilize special purpose computers which have a high-speed array processing capability together with scientific processing and to combine those features with the use of instruction cache memories, pipeline design, or the like. Such a combination yields improved performance at the expense of additional, high-priced hardware, which has an average operating speed well under the high-speed operation capabilities of the special purpose digital computers.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other advantages of the present invention, together with its various features, can be more easily understood based upon a more detailed description of the preferred embodiment taken in conjunction with the following drawings:

FIG. 1 is a prior art logic diagram which represents a multicomputer system which is capable of performing multiple tasks by means of a multiple processsor/memory combination and parallel task execution;

FIG. 2 is a prior art logic diagram which represents a coprocessor computer system of the prior art which is capable of performing a single task using pre-post processing parallel multiple central processing units for executing the instructions;

Fig. 3 is a prior art logic diagram of a prior art pipelined computer system which is capable of performing a single task with multistage execution utilizing a single processing unit and serial execution;

Fig. 4 is a diagrammatic representation of an integrated, multicomputer data processing system which includes one user processor section having a math processor and an organizational processor;

FIG. 5 is a diagram representing an integrated, multicomputer data processing system having N user processor sections;

FIG. 6 is a block diagram of a typical user processor showing the various components thereof;

FIG. 7 is a diagrammatic representation of a multiple task, multiple processor/memory combination which demonstrates the capability of the integrated, multicomputer data processing system of the present invention interfaced with a plurality of work stations;

FIG. 8 is a logic diagram showing the functional relationships between the user processor, system memory bus, and staging address bus for transferring instructions and data in parallel into the user processor; FIG. 9 is a logic diagram illustrating the relationship between the operating system processor, which is operatively coupled to an input/output data transfer means, and a system controller;

FIG. 10 is a logic diagram illustrating the relationship between the system controller, the system memory bus, and the system addressing bus;

FIG. 11 is a logic diagram illustrating the various components of an XY machine which functions as the math processor and an A machine which functions as the organizational processor, each of which is capable of performing arithmetic operations;

FIG. 12 is a block logic diagram illustrating the operation of the A machine in performing arithmetic calculations in the user processor; FIG. 13 is a block diagram showing the relationship between the various components of the XY machine, showing specifically the shared arithmetic components between the X machine segment and the Y machine segment thereof; FIG. 14(A) and 14(B) are block diagrams illustrating the arithmetic registers for the X input and Y input of the XY machine, respectively;

FIG. 15 is a block logic diagram illustrating the X memory address controller;

FIG. 16 is a block logic diagram illustrating the data path for the XY random access memory;

FIG. 17 is a block diagram illustrating the operation of the instruction sequencer which controls the decoding of instruction to provide control signals to the XY machine and the A machine; FIG. 18 is a diagrammatic representation of a compiler illustrating a variety of front end language processors applied through a compiler interface to a back end user processor code generator which generates executable application object codes for executing an optimum number of calculations of the mix of operations in the user processor; and

FIG. 19 is a block diagram illustrating the operating system functions of the integrated, multicomputer data processing system wherein a wide variety of standard input-output devices at the applications layer are interfaced to the system interface layer and the tasks are distributed to a hardware layer and executed in parallel.

DESCRIPTION OF THE PREFERRED EMBODIMENT FIGS. 1, 2 and 3 relate to the prior art computer system architectures and means for executing instructions and have been discussed hereinbefore in the Description of the Prior Art section.

FIG. 4 diagrammatically represents the integrated, multicomputer data processing system of the present invention having a global memory section 100, a high-speed input-output section 102 which is operatively connected to a fast disk 104, and a control section shown generally by dashed lines 108. The control section includes an input/output data transfer means 110 for system input and transfer means 110 is adapted to network or interact with a wide variety of state-of-the-art peripheral devices such as, for example, a system disk 114, a terminal 116, magnetic tape 118, a network of work stations 120 and 122 which are operatively connected through a networking system such as ETHERNET 126 to the input/output data transfer means 110, and data communications 128 which likewise are connected to the input/output data transfer means 110.

The control section 108 includes a system memory bus 132 and staging bus 136, each of which is capable of transferring 64 bits of data and 32 bits of address.

The control section 108 further includes an operating system processor 140 which is operatively connected to the input/output data transfer means 110 for input-output operations by a connecting means illustrated by arrow 142. Also, an operating system processor random access memory 146 is operatively connected to the operating system processor 140 through a connecting means illustrated by arrow 148. Also, the random access memory 146 is operatively connected to the input/output data transfer means through a connecting means, as illustrated by arrow 150.

The control section 108 further includes a system controller 154 which operates under control of the operating system processor 140. The system controller 154 is operatively connected to the input/output data transfer means by a communication means which is represented by arrow 158. The operating system processor 140 controls the operation of the system controller 154 by means of execution signals and interrupt signals which are applied from the operating system processor 140 and connecting means 142 through the input/output data transfer means 110 and communication means 158 to the system controller 154. The system controller 154 is operatively coupled via connecting means illustrated by arrow 162 to the system memory bus 132 and is connected via a communication means illustrated by arrow 164 to the staging bus 136.

The integrated, multicomputer data processing system is a modular system and includes at least one user processor section illustrated by the dashed box 200 which is operatively connected between the system memory bus 132 and the staging bus 136. The user processor section 200 includes an organizational processor 202 having a random access local memory 210 and a math processing section 218. The random access local memory 210 is operatively connected to the organizational processor 202 through a connecting means evidenced by arrow 212, and a multistage, first-in, first-out (FIFO) buffer register 214. The FIFO buffer register 214 is connected both to the organizational processor 202 and to the math processor section 218 by a connecting means 216. In addition, the math processor section 218 is connected to the local memory 210 through a single-stage output register 220 and connecting means 222. In addition, the local memory 210 is operatively connected to the system memory bus 132 by a connecting means as evidenced by arrow 218. The organizational processor 202 is connected by a connecting means 226 to the system controller 154. The user processor 200 includes a program memory 228 which is operatively coupled to the organizational processor 202 by a connecting means shown by arrow 230. The program memory (sometimes referred to as the "P Memory") 228 is also operatively connected to the staging bus 136 via a connecting means represented by arrow 232. The staging bus 136 loads the data caches 246 and 248 via connecting means 232 and 206 respectively.

The user processor's math processing section 218 includes two sets of dual data cache memories 246 and 248. In addition, the math processing section 240 includes a math processor 252 having a plurality of connecting means, shown generally as 260, which interconnects each of the two sets of dual data caches 246 and 248 to the math processor 252 such that data can be transferred therebetween. The data caches 246 of the math processing section 218 and the P Memory 228 are likewise connected to the staging bus 136 by connecting means illustrated by arrow 232.

Data transfer from the local memory 210 to the organizational processor 202 occurs over connecting means 232 and an input buffering means such as the FIFO buffering register 214. Data transfer from the math processor section 218 occurs through the output means 220, which in the preferred embodiment is in the form of a single stage, clocked output register which is capable of transferring numeric results from the math processing section 218 to the output register 220 in one clock cycle. The output register 220 transfers the same to the local memory 210 for further processing. The communication between the local memory 210 and the organizational processor 202 is in the form of an asynchronous transfer of data and, in the preferred embodiment, utilizes a three-stage, first-in-first-out buffer register as an input buffering means 214. In addition, the math processing 218 is operatively connected to the P Memory 228 by connecting means illustrated by arrow 276. The organizational processor 202 has direct communication to the math processor section 218 over communication means 216 which permits data transfer therebetween without going through a register. In operation, the organizational processor functions to keep a number of instructions ready for the math processor such that an instruction can be immediately loaded into the math processor together with the applicable data. The organizational processor loads the instructions and data as the performed arithmetic operation just completed by the math processor is loaded into the local memory.

The operating system processor 140 controls operation of the integrated, multicomputer data processing system by means of the input/output data transfer means 110 which interfaces with the input-output devices 114 through 128, and which applies interrupt and execution signals to the system controller 154 to control overall system operation. The system controller in turn, controls the organizational processor 202 via communication means 226. The system controller 154 under control of the operating system processor 140 functions to load data from the system memory bus 132 into the user processor 200. The staging bus 136 loads the user processor 200 with data and instructions concerning the programmed arithmetic operations. The user processor 200 is able to communicate over the system memory bus 132 with the high-speed input-output processor 102 by a connecting means shown generally by arrow 280. Similarly, the user processor 200 can communicate with the global memory 100 by means of the system memory bus 132 and a connecting means illustrated by arrow 282.

As is evidenced from the pictorial representation in FIG. 4 , the integrated, multicomputer data processing system is modular in design and can have additional user processor sections added thereon to increase both MIPS and MFLOP operating rates of the data processing system. In FIG. 4, the operating system processor 140, including its random access memory 146, operates to schedule and assign tasks and command signals through the system controller 154 over the system memory bus 132 and the staging bus 136 to each user processor. Information is available to the operating system processor from any of the peripheral devices 114, 116, 118, 120, 122, 126 or 128 by means of the input/output data transfer means 110. The input/output data transfer means also is capable of loading information directly into or receiving information from the random access memory 146 and from the operating system processor 140. The operating system processor 140 via the connecting means 142, input/output data transfer means 110, and connecting means 158 schedules and assigns tasks to each user processor independently through the system controller 154. The operating system processor 140 generates command signals in the form of interrupt signals and execution signals for controlling system operations and also, through the system controller 154, controls operation of the high-speed input-output processor 102, including the fast disk 104 and the global memory 100.

FIG. 5 illustrates an integrated, multicomputer data processing system which includes the same components illustrated in FIG. 4 and which are illustrated by the same numbers utilized in FIG. 4, together with the first user processor section 200, a second user processor section 302, and an N user processor 304. In the preferred embodiment, the integrated, multicomputer data processing system is capable of operating with up to four user processors based upon the design set forth herein. However, it is envisioned that the operating system processor, system controller, system memory bus, and staging bus could be designed and programmed with appropriate compilers and the like in order to handle more than four user processors in order to increase the MIPS and MFLOP rates of processing data. The operating system processor 140 controls operation of each of the user processors 200, 302 and 304 as illustrated in FIG. 5.

In the alternative, the configuration of FIG. 4 could be duplicated, for example with four user processors in a group, and then arranged with each group in a cluster configuration having a cluster operating system processor to schedule and assign tasks to each group of four user processors, all of which operate independently. FIG. 6 illustrates in greater detail the method for loading both data signals and addressing signals into the user processor. FIG. 6 illustrates that the system memory bus 132 also includes a system address bus shown as 132' and that the staging data bus 136 includes a staging address bus 136'. FIG. 6 illustrates the major components and connecting means with the same numeral designations as those of FIG. 4 and further includes conecting means between the organizational processor 202 and the system address bus 132' by means of connecting means illustrated by arrow 204'. Also, the organizational processor 202 is operatively connected to the staging address bus 136' by connecting means shown as arrow 206'. Likewise, the local memory 210 is operatively connected by a connecting means shown by arrow 218' to the system address bus 132'. The math processing section 240 is likewise connected to the staging address bus 136' by a connecting means shown as arrow 266'.

As illustrated in FIG. 6, the transfer of data signals and address signals is performed by means of the system memory bus 132, the system address bus 132', the staging data bus 136 and the staging address bus 136'.

The system controller 154 is operatively coupled to the operating system processor 140 for receiving command signals which command the system controller 154 to transfer data signals over the staging data bus 136 and to transfer address signals over the staging address bus 136' to the user processor 200 to preload the user processor for the next operation while the user processor is executing a program. The system controller concurrently transfers data signals over the system memory bus 132 and addressing signals over the system address bus 132'.

In the preferred embodiment, the operating system processor 140 is generally referred to as a Kernel Processor. In one application, the operating system processor comprises a Sun 2 68010 processor/memory board with two megabytes of memory and an attached input/output data transfer means for system I/O. The Kernel Processor runs the Berkeley version 4.2bsd UNIX operating system. When the operating system processor 140 desires to initiate a task in another processor, the operating system processor 140 fills out a task control block specifying the task, its size in bytes, its starting address in memory, together with additional information. The operating system processor 140 then commands to the system controller 154 to load the task into the user processor, such as user processor 200, followed by a later order to the user processor 200 to begin executing the task.

The system controller 154 is a special block transfer device which loads the user processors 200, 302 and 304 (FIG. 5) with tasks to be executed based upon commands from the operating system processor 140. In essence, the system controller 154 operates directly under control of the operating system processor 140. The staging controller 154 performs two specific functions, namely: (1) transfer of data between a user processor and common data memory; and (2) support of input/output data transfer means input/output devices to and from common data memory. In the embodiment illustrated in FIGS. 4 and 6 the user processor 200 has a large program cache 228 which has a capacity in the order of 256 KB. Also, the user processor 200 includes duplicated context switching facilities which permits the user processor 200 to execute one task while the system controller 154 is removing the last executed task from the user processor or is loading the next application task into the user processor. Thus, the system controller 154 performs the function of saving the information in common data memory and later fetching the same from storage on disk. The system controller 154 functions to keep the user processor 200 operating at maximum efficiency by minimizing waiting time experienced by the user processor 200 waiting to receive a new task or transferring data during the various staging processes. Also, the system controller 154 functions to enable the user processor 200, or any other user processors operatively connected to the system, such as user processors 302 and 304 in FIG. 5, to operate independently at maximum efficiency.

As illustrated in FIGS. 4 and 6, the user processor section 200 includes a random access memory 210 and a buffering means 214, in the form of a first-in-first-out buffer register, for enabling programmed, asynchronous transfer of numerical results between organizational processor 202 of the user processor section 200 and the random access memory means 210. The local memory applies the information via connecting means 220 to the first-in- first-out buffer register 214 and receives information from the output register 220. The organizational processor 202 includes means which are responsive to an instruction stream read out from the instruction cache 228 in response to instruction addresses. A certain portion of the instruction stream relates to the math processor 252 and the applicable portion of the instruction stream is decoded to produce control signals which are used to control machine segments having shared elements of the math processor to execute selected mathematic operations on the data signals transferred to the user processor section 200.

FIG. 7 illustrates diagrammatically the architectural structure of the integrated, multicomputer data processing system of the present invention for performing multiple tasks utilizing multiple processor/memory combinations. The task source may be initiated in a work station computer with the work station task being identified by rectangles 300 and 302 illustrating "n" work stations. The work station task 300 is applied to a memory 304 while work station tasks 302 are applied to "n" number of memories 306. The memory 304 applies instructions from memory to the work station processor 310 while the other "n" memories 306 likewise apply instructions to "n" processors designated at 312. The outputs of the work station processors 310 and 312 are applied via an input/output data transfer means, such as MULTIBUS or ETHERNET for a local area network, as illustrated in FIG. 4, to the operating system processor 140. The operating system processor 140 receives a task such as, for example, a task assigned from the work station processor 310, which is designated by rectangle 320. The task is then stored in the operating system computer memory 146 which, in turn, applies the instructions to the operating system processor 140, which ultimately applies the same to a data memory. The operating system processor 140 then applies appropriate command signals together with the appropriate instructions and data via the system controller 154, to the user processor section. The application task being assigned to the user processor is designated by rectangle 342 for a single task and by rectangles 344 for application tasks being assigned to other user processors. The application task 342 assigned to the user processor is stored in the user processor memory, such as memory 210. Likewise, other user processors apply the task to its associated "n" memories designated by memories 348. The memories 210 then apply the instructions to the organizational processor 202 and to the instruction cache which is represented as a processing and instruction cache by rectangle 350 in FIG. 7. In the alternative, the instruction cache included in rectangles 350 and 352 can be loaded from the staging bus. The processing and instruction caches for the other "n" user processors are designated by rectangles 352.

An instruction cache and processing section, shown generally by rectangle 350, performs the function of receiving an instruction stream and processing the same such that certain portions of the instruction stream relating to a the mathematical processor are decoded to produce a control signal which is applied to the appropriate segment of the math processor 252. Similarly, certain portions of the instruction stream are decoded to produce a control signal which is applied to the organizational processor 202, which is capable of performing simple arithmetic calculations. The math processor 252 and the organizational processor 202 execute the programmed operations on the data and transfer the results therefrom to the local memory 210 through an output register such as the single stage output register 272 in FIG. 1. The organizational processor 202 also ensures that sufficient instructions and data are available for loading into the math processor upon completion of a performed arithmetic operation to keep the math processor operating at maximum efficiency. Thus, the user processor, as a separate processor application section, can be viewed as a single task, parallel execution computer system.

FIG. 8 is a logical block diagram representing the architecture of a user processor 200 illustrated in FIGS. 4 and 6 and of the additional user processors 320 and 304 as illustrated in FIG. 5. In FIG. 8, the logic diagram is illustration for an integrated, multicomputer data processing system having two identical user processor sections shown as 540.

Transfer of information into the integrated, multicomputer data processing system from the outside world is accomplished through peripheral devices which are applied to an input/output data transfer means 500 which, in turn, transfers the information between the input-output devices and the operating system processor 502. The operating system processor 502 communicates over the input/output data transfer means 500 with the system controller shown by rectangle 506. The system controller is operatively connected to a system memory bus 510 and a system address bus 512. The system memory bus is capable of communicating 64 bits of data and 32 bits of address within the data processing system. The system controller 506 is operatively connected to the staging data bus 514 and to a staging address bus 516. The staging data bus transfers data having 64 bits and addresses having 32 bits within the data processing system to specific user processors.

The system memory bus 510 and the system address bus 512 are operatively connected to a global memory 522 by means of a system memory bus segment 524 and a system address bus segment 526. Local data memories 548 are likewise connected to the system data bus 510 via bus segment 536 and to the system address bus 512 via bus segment 558. In a similar manner, the controller and high-speed interface 528 communicate to the high-speed input-output device over a bus 534. The controller and high-speed interface 528 likewise is operatively connected to the system memory bus 510 via a system memory bus segment 530 and to the system address bus 512 by system address bus segment 532. The controller and high-speed interface 528 is connected to the staging data bus 514 by means of staging data bus segment 518 and to the staging address bus 516 by staging address bus segment 520. The controller and high-speed interface 528 is adapted to apply data to and receive data from each of a system memory bus 510, system address bus 512, staging data bus 514, and staging address bus 516.

The logic diagram of FIG. 8, having two identical user processors 540 illustrates that each user processor 540 includes a math processor, generally referred to as the "XY machine" 544, an organizational processor 546, generally referred to as the A machine, and a local memory 548. Also, each user processor includes an instruction cache or program memory 550. The system memory bus 510 is operatively connected to the XY machine 544, to the A machine 546 and to the local memory 548 by means of a local data bus segment 556. System address bus 512 is operatively connected to the organizational processor 546 and to the local memory 548 by means of a local address bus segment 558.

The instruction cache 550 is operatively connected to the stating bus 514 via staging bus segment 560 and to the staging address bus 516 via a staging address bus segment 556. The staging data bus 514 is operatively connected to the XY machine 544 by means of staging data bus segment 562. In a similar manner, the XY machine 544 is operatively connected to the staging address bus 516 via staging address bus segment 564. The instruction cache 550 is operatively connected to the XY machine via a bus 570 and to the organizational processor, or the A machine, by bus segments 522. The logic block diagram of FIG. 8 clearly illustrates the flexibility and capability of the operating system processor 502 to control the system controller 506 enabling the system controller 506 to load and/or remove data from each of the user processors while the user processor is executing an arithmetic operation. The system controller 506 under control of the operating system processor 502 transfers data into and out of the user processors 540 by means of the system memory bus 510 and the staging address bus 514. The transfer of data can be in parallel, as evidenced by each bus segment over the system data bus to global memory.

FIG. 9 is a logic block diagram which illustrates the process utilized by the integrated, multicomputer data processing system for preloading the user processor with data signals and addressing signals to ensure that the user processor is continuously loaded with assigned tasks in the form of programs and data and for transferring executed arithmetic operation results from the user processor P Memory and then to the input/output data transfer means and I/O devices.

As illustrated in FIG. 9, the operating system processor 502 is operatively connected to the input/output data transfer means 500 to transfer information between the input-output devices and the operating system processor 502. The operating system processor 502 is operatively connected to the system controller 506 and applies program signals, interrupt signals, and execution signals to the system controller 506. When the operating system processor 502 desires to initiate a task in the form of a programmed arithmetic operation in a user processor, the operating system processor 502 completes a task control block specifying the task or arithmetic operation to be performed, the size of the task in bytes, and the starting address in memory, together with additional control information. When the task information has been assembled by the operating system processor 502, the operating system processor 502 generates an interrupt signal which is applied to the system controller 506 by the input/output data transfer means 500. The system controller 506 receives the task control block of information from the operating system processor 502 and loads the information contained within the task control block such that the data and instruction addresses required for the system memories is loaded over the system memory bus 132 to the user processor and the staging information in terms of the programmed arithmetic operation, the data signals and addressing signals therefor are loaded into the user processor over the staging bus 136.

The operating system processor 502, together with the system controller 506, is capable of controlling up to four user procesors as illustrated in FIG. 5. The system memory bus 132 is capable of transferring data between up to eight banks of memory wherein each memory comprises up to 32 megabytes and the data can be transferred at a rate of 56 megabytes per second. As noted in connection with the description of FIG. 5, each user processor has a local memory 210 which is an associated local memory for that specific user processor and the system memory bus 132 has system access to all of the memory banks including the local memory banks.

FIG. 10 is a logical block diagram which illustrates the logic scheme wherein a math processor 546 can gain access to its local memory 548 without the necessity of utilizing- the system memory bus 136. This permits the operating system processor to determine, on a priority scheme, whether the system memory bus 136 has a higher priority data transfer request and, if so, the user processor 546 requiring data can access the same from the local memory 548 over the local data bus segment 556. However, if the user processor 546 requires data located in another memory, such as memory 548, that information can only be accessed over the system memory bus 136. If the user processor 546 requires data signals and addressing signals which are to be loaded therein from the staging bus, the staging bus 132 must directly load the user processor 546 with the required information through the system controller 506 under control of the operating system processor 502.

As discussed in connection with FIG. 8, each of the user processors 540 includes an XY machine which functions as the math processor. The XY machine includes memories which are capable of processing data at the rate of 56 MB per second. Likewise, the X machine and Y machine include staging memories which are able to transfer data at the rate of 56 MB per second. Each user processor includes instruction and data page tables, program memory organized at 128 pages of 2 KB each capable of transferring data at the rate of 112 MB per second, and data memories organized into 8 KB pages of memory wherein the data memories are capable of transferring data at the rate of 56 MB per second.

FIG. 11 is a logical block diagram illustrating the elements of the XY machine which functions as the math processor and of the A machine which functions as the organizational processor.

The A machine includes a program memory 600 which provides an instruction stream selected by means of a program sequencer 604 which is responsive to addressing signals. The output of the program sequencer 604 is applied to an address translation unit 606 which provides the specific location in the program memory 600. The information contained in the program memory 600 at the location address derived by the address translation unit 606 is read out as an instruction stream into the instruction buffer 608. Instruction buffer 608 applies the specific instruction, in the form of a 96-bit length word, into an instruction extraction means 610. The instruction extraction means 610 produces two instructions as an output, an XY instruction as shown by arrow 612, and an A instruction as shown by arrow 614. Continuing with the A machine instruction, the A machine instruction 614 is applied to an A decoder 616 which produces a control signal as an output, as illustrated by arrow 620. The output is applied to the remaining portion of the A machine, or organizational processor, which is capable of responding to the control signal to execute basic arithmetic functions such as shifts, logic operations (AND, OR, NOT, XOR, etc.), addition, subtraction, multiplication, and control functions (e.g., branch).

The A machine includes a register file 630 and a set of special registers 632 to provide the data upon which the arithmetic operation is to be performed, as designated by the A control 620. The output of register file 630 is applied to a left shift unit 634, to a multiplication unit 636, or to an ALU (adder) unit 638. The output of the ALU 638 is returned to one of the A machine special registers. In the alternative, if the results of the arithmetic operation performed by the organizational processor, or A machine, is to be utilized by the XY machine, the arithmetic results are applied to either of the XY buses, XB or YB buses so that synchronous transfer of data can be obtained between the organizational processor A machine and the XY machine. The operation of the A machine is under clocked control such that each arithmetic operation is performed in one clock cycle.

Referring now to the XY machine portion of FIG. 11, the XY instruction 612 is applied to a decode random access memory 650. If the XY instructions appearing on XY instruction input 612 contain a branch or subroutine component, that requirement is determined by a microsequencer 652 which, in turn, enables a microcode random access memory 654 to produce a microcoded instruction signal which is applied to the XY control 656. In any case, the decode random access memory 650 applies the decoded instruction during the first clock cycle to the XY control 656.

The XY machine has two major segments with shared elements. The X machine segment includes a memory 660 which is responsive to an address unit 662 wherein data from registers 664 and 666 are loaded by the address unit 662 into the X memory 660. Also, the X machine segment contains two registers 670 and 672 which are adapted to store data read from the X memory 660. The X machine segment includes a simple or basic arithmetic processor shown generally as 680 which is capable of performing IEEE/INTEGER, ALU and conversion operations, the output of which is applied to output registers 674 and 676. The output of the registers 674 and 676 is applied to the output buffer register wherein the results of the arithmetic operations are transferred directly to the organizational processor.

The Y machine segment includes a Y memory 684 which is controlled by Y address unit 686 which is capable of loading data from registers 690 and 692 into the Y memory. Information is read from the Y memory and is stored in registers 694 and 696. In a similar manner, the Y machine segment includes an arithmetic processor 700 which is capable of performing IEEE/INTEGER, multiplier, elementary functions, square root, divide, sine, cosine, A-tangent, exponential, logarhythmic, and the like. The results of the arithmetic processor 700 are stored in registers 702 and 704. Registers 670, 672, 694 and 696 are capable of having information from either the X machine segment or the Y machine segment gated therein under controlled gating transfers. In a similar manner, information stored in registers 674, 676, 702 and 704 can be gated therein from either the X machine segment or the Y machine segment, all as shown by the letter designations on the registers.

A switch 710 is located intermediate the XY machine and functions to control the gating of information into the various registers, as described above, or to control the gating of information into either the processor 680 of the X machine segment or the processor 700 of the Y machine segment.

FIG. 12 illustrates the flexibility of the A machine in terms of gating of instructions and data among the various registers, arithmetic units, and buses which are interconnected with the XY machine. The instruction address unit 610, the A register file 630, the A special registers 632, the shift left unit 634, the multiplication unit 636, and the ALU 638 are numbered with the same numeral designations as in FIG. 11.

The A machine is interconnected by a plurality of buses such as, for example, AB bus 720, ALU bus 724, AR bus 744, X bus 732, Y bus 734, D data bus 740, and XY CB bus 742. External buses comprising the X bus 732 and Y bus 734 are used for bidirectional data, while the D data bus 740 and CB bus 742 are used for data interface.

The use of the various components of the A machine for transferring data and/or instructions over the various buses can be understood from the following example. Attention is directed to the A special registers 632 which are gated onto the AB bus 720 through a gating means 722 and into the ALU 638 A input. The A register 630 B output is gated into the shift left unit 634, the output of which is gated into the ALU 638, B side. The ALU 638 produces an output which is applied to the AR bus 744 or to one of the A special registers 632.

FIG. 13 illustrates the X machine segment and the Y machine segment, which share common elements, by means of a block diagram including all the various buses and gating means therein. The components such as, for example, the Y memory 684 and the X memory 660, the X address control 662, the Y address control 686, the various registers 664, 694, 696 and 672, the output registers 674, 676, 702 and 704, and the arithmetic unit 638 and multiplication unit 636, are designated with the same symbols as appear in FIG. 11. The logic diagram of FIG. 13 illustrates that the various buses and gating means permit transfer of information between various system components including from the output registers 674, 676, 702 and 704 which are referred to as V, T, U and W registers, respectively.

FIGS. 14(A) and 14(B) are logic diagram for the arithmetic registers wherein FIG. 14(A) is for the X input and FIG. 14(B) is for the Y input. FIG. 14(A) illustrates that the registers 664 and 696 are gated at the output thereof such that the information contained therein can be applied to other arithmetic registers 676 and 674 to enable the results of the various executed programmed arithmetic operations to be available for subsequent operations. In a similar manner, FIG. 14(B) shows that register 692 and 696 of the Y machine segment can be gated into the registers 702 and 704 of the Y machine segment or, by means of the various buses, can be gated into the registers of the X machine segment. The arithmetic registers of FIGS. 14(A) and 14(B) include transparent or latch registers shown generally as 693 which can transport or latch depending on the program.

FIG. 15 illustrates a logical diagram for the X memory address controller showing the various components thereof. FIG. 15 illustrates that the X memory address controller controls storing and reading of data therein. The address controller is capable of expanding the boundaries of stored information within the memory under control of the address controller such that the length of the data word can vary between a maximum and a minimum. In FIG. 15, the elements which are common with the elements of FIG. 11 are designated by the same numerals. Thus, registers 664 and 666 are adapted to act as the registers utilized for input into the X memory address controller. Register 664, under clocked control, directly transfers data into arithmetic unit 670 and the output of the register 664 can likewise be clocked into other buses, such as buses 670 and 672. The input register 666 has its output applied to the same arithmetic unit 670 and to arithmetic unit 672. The arithmetic calculators 670 and 672 calculate the amount of memory required for the specific data word in terms of bit storage. The output of processor 670 is applied as an input to the arithmetic unit 672. The output of the arithmetic unit 672 is applied to a maximum-minimum calculator comprising arithmetic units 678 and 680. Arithmetic unit 678 has as one input which is a minimum memory limit which is determined by the x min circuit 682. The input from the x min circuit 682 is applied to the arithmetic unit 678 to determine whether the required storage is under the minimum. In a similar manner, the maximum amount of storage required in the memory is determined by an x max circuit 690, the output of which is applied as one input to the arithmetic unit 680 which determines whether the maximum storage required for the data word is in excess of the maximum set by circuit 690. The amount of storage required in the memory for a data word is determined by arithmetic units 678 and 680. The outputs of units 678 and 680 are applied to a transfer circuit 692.

The Y memory controller has an identical YM address controller as illustrated in FIG. 15.

FIG. 16 is a logical block diagram of two fast [45NS] 16KB data caches which are used by the compilers as fast program stacks and register files for the XY math processor. In addition, the byte-addressable data caches can also be used as fast array memories when required for first data cache 700 and a second data cache 702. The Y memory utilizes two data caches 708 and 710. The data caches can be loaded by the system controller through clocked gating means shown generally as 712 and 714. The inputs to the gating means 712 and 714 can be from a variety of sources such as the X bus, the staging bus, or organizational processor.

In a similar manner, the Y memory data caches 708 and 710 are gated to receive either data signals or byte-addressable data caches for use as a fast array memory or for byte-addressable data caches. When one set of data caches is being used by the XY machine, the other set of data caches is being loaded. When required, the gating means 712, 714, 716 and 722 are gated to enable the XY machine to then use the loaded data caches as active caches while the inactive caches are then loaded for subsequent use.

In the embodiment illustrated in FIG. 17. each of the dual data caches, 700 and 702 for the X machine and 708 and 710 for the Y machine, may be used by the compilers as a fast program stack and as register files. In other modes, as selected by the various applications compilers, the byte-addressable data caches can also be used as fast array memories. Each of the dual, 32-bit wide data caches 700 and 702 for the X machine and 708 and 710 for the Y machine can be loaded by the system controller and are duplicated to permit fast context switching through the switching means 712, 714, 720 and 722. The data caches 700, 702, 708 and 710 can also be loaded by the math processor on a single location load or store basis (random access).

As noted hereinabove, the input data transfer mechanism between the data memory, such as the local memory 548 in FIG. 8, is a special 64-bit wide, 3-register deep, first-in-first-out buffer to allow for relative data movement flexibility between the math processor and the memory system. Such an arrangement permits the organizational processor to move ahead of the arithmetic operation being executed by the user processor and the math processor thereof to acquire the next three data items, and while the math processor is utilizing these three data items, the organizational processor can be executing sequence control instructions and specific integer operations as required to keep the user processor operating at maximum efficiency. The output data transfer mechanism between the user processor and data memory is by means of a single stage output buffer. Once the results of the programmed arithmetic operation which has been executed by the math processor has been transferred to the output buffer, the user processor continues with further processing while these results are written to main memory storage.

FIG. 17 illustrates the operation of an instruction sequencer which is utilized by the user processor for producing from its program memory the instruction stream required to control the A and XY processors. Typically, the address will be stored in register 900 and when clocked the addressing signals are applied to a program transaction table 902 which address a specific location in the program memory. The same information applied to the program transaction table 902 is loaded into an adder 901 for subsequent use if necessary. The program memory 904 contains instruction streams comprising 4-32 bit instructions which are stored in a 256 KB program memory accessed 16 bytes at a time. The addressing signals cause a full 128-bit word instruction stream to be read out of the program memory 904, as illustrated by the shaded portion 906 in memory 904. The instruction stream represented by shaded portion 906 is transferred via transmission means 910 into an instruction buffer 912. The instruction buffer is required because each of the transfers of instruction stream is in response to a single clock cycle. Thereafter, and in response to the next clock cycle, the instruction buffer transfers the information it has therein via transmission means 914 into rotating network 916 which preprocesses the instruction stream to place it into a certain format to determine whether any type of arithmetic operation is to be performed and, if so, which arithmetic processor, namely the XY machine or the A machine, is to perform the operation. The first bit of the instruction stream in the rotating network 916 is sensed to determine whether an arithmetic process is required and, if not, that information is passed via connecting means 920 to an adder 923 which stores the same in the shift register 922 to indicate that no arithmetic operation is required. If an arithmetic operation is required, the first bit of the instruction stream contained in rotating network 916 is sampled to determine whether the arithmetic operation is required to be performed by the XY machine or the A machine. If the arithmetic calculation is to be executed by the XY machine, the appropriate instruction appears on output 930.

If the instruction designates that an arithmetic operation is to be performed by the A machine, that information is sensed by the X instruction width detecting means 932 which causes the instruction to be transferred via communication means 934 to a left shift register 940 at the same time the information is passed to the XY machine for decoding by the rotating network 916. The XY machine decoding produces an X or Y control signal to cause the X machine segment or the Y machine segment to execute the assigned arithmetic operation. Thereafter, the portion of the instruction which is applicable to the A machine is applied by output 942 to the A machine decoder which decodes that portion of the instruction stream to produce an A control signal which causes the A machine to execute the assigned arithmetic operation.

FIG. 18 is a block diagram representing the design of the compiler which is capable of making all of the parallel expansion decisions at compile time in lieu of use of complicated preprocessors and postprocessors dividing and distributing the program code among several coprocessors in parallel. As a result, the user processor is relieved of the requirement of dividing and distributing the program code for execution and eliminates the need for additional hardware to accomplish the distribution and execution of the program code.

In FIG. 18, the structure of compilers for the user processor utilized in the integrated, multicomputer data processing system described in connection with FIGS. 4, 6 and 8 comprises several front end language processors such as a C compiler front end 730, a FORTRAN 77 compiler front end 732, a VAX FORTRAN compiler front end 734, and other compiler front ends illustrated by rectangles 736 and 738. The front end compilers are applied to a global optimizer and user processor code generator 746 which functions as a single back-end common code generator and optimizer. As noted hereinbefore, the primary languages preferred for use in the user processor of the present invention are C and FORTRAN 77, both of which are compatible with the UNIX standard.

The global optimizer and compute engine code generator computes an engine object code, or a user processor object code, represented by arrow 748, which is loaded into a linker/loader system 756. Other object module and run time routines from a plurality of sources, as represented by rectangle 752, can likewise be applied to the linker/loader 756 as represented by arrow 758. The output of the linker/loader 756 is an executable compute engine object code or an executable user processor object code appearing on lead 760. The output from the linker/loader also appears on a second output 764 which is applied to a screen-oriented symbol debugger 766 for viewing by the programmer. The global optimizer and compute engine code generator 746 uses the following optimizing techniques:

(1) Flow analysis;

(2) Low level common subexpression elimination;

(3) Folding and propagation of constants; (4) Dead store elimination;

(5) Register allocation by lifetime analysis;

(6) Loop optimization; and

(7) User processor optimization.

The compiler rearranges the normal sequence of program execution in order to make the most efficient use of the user processor. For example, the user processor can simultaneously execute a simple math operation such as an add, a complex math operation such as a multiply, and a fetch/store operation from three memories, two data caches plus main memory. Since the user processor can simultaneously execute a number of programmed arithmetic operations, the optimizer rearranges the code to ideally perform all five operations in each and every cycle, and as many times as possible. Rearranging the code may result in the user processor's calculating parts of a problem out- of a normal sequence. An example wherein the code is rearranged to calculate parts of a problem out of normal sequence is that instead of calculating an entire mathematical formula on each pass of a 1,000 iteration loop, the compiler may arrange to calculate the first two terms of the expression 1000 times, put the results in a data cache table, go back and calculate the next two items 1000 times, and so on. The end result is that the calculations are performed in a highly efficient manner by the user processor scientific processor. The compiler functions to arrange the mix of operations to process the programmed arithmetic operations and to perform the maximum number of operations at the same time in a single clock cycle. However, the compiler optimizes the arrangement such that the mix of operations does not result in the user processor's becoming idle during execution of programmed arithmetic operations or calculations as the result of a poor mix of coded, programmed arithmetic operations. As noted hereinabove, the optimization operations take place in the global optimizer and compute engine code generator 746, which is a second stage of the compile process, which enables the programmer to utilize any one of a number of front end languages as illustrated in FIG. 18.

The integrated, multicomputer data processing system of the present invention utilizes a modified version of the Berkeley 4.2bsd UNIX software in the preferred embodiment. FIG.19 illustrates how the modified Berkeley 4.2bsd UNIX software, which was originally designed to operate in a monolithic computer environment, has been converted to a multicomputer, time-sharing system which is capable of handling a plurality of users to perform scientific and engineering compute intensive virtual applications involving simulation, signal processing, image processing, and other complex math processing. As illustrated in FIG. 19, the Berkeley 4.2bsd UNIX software includes support for a C shell Command Interpreter 802, UNIX utilities 804, a wide range of scientific applications as represented by rectangles 806, and is capable of use with the Lan and Datacomm network requests represented by rectangles 814. The C shell Command Interpreter 802, UNIX utilities 804, applications 806 and Lan and Datacomm network requests 814 comprise the application layer of the operating system function. A user, by means of a terminal 808, is able to interface with user interface 810 to either the C shell Command Interpreter 802, the UNIX utilities 804, or applications 806. In the alternative, the user can utilize an intelligent work station 816 which, through a user interface 822, can be operatively connected to the Lan and Datacomm network requests 814 in order to perform the desired operating system function.

Each of the above described C shell Command Interpreter 802, UNIX utilities 804, applications 806 and Lan and Datacomm network requests 814 are applied via communication means represented by arrows 830 into the system call interface layer 832 which includes a UNIX system kernel or Kernel Processor 836. The system call interface 832 provides the capability of distributing software tasks among input-output peripherals 846, user processors 848, and into an additional multibus & ETHERNET systems 852 which represents the hardware level for performing the distributed, parallel processing tasks.

The operating system function represented by FIG. 19 is extremely versatile and modular, allowing several user processors and work stations to be attached to the integrated, multicomputer data processing system at the same time and structured such that the users perceive a standard monolithic 4.2bsd UNIX environment for developing and executing their software. In use, the system call interface layer 832 separates the monolithic image of the application layer from the hardware layer where the tasks are distributed and executed in parallel.

To illustrate the way in which the integrated, multicomputer data processing system user processor works to produce an integrated combination of mathematical operations in parallel, a representative FORTRAN example set forth below (comprising a program segment taken from an actual hydrodynamics problem) will be examined for the user processor architecture: CODE EXPLANATION

SUBROUTINE ALPHA name of the subroutine

COMMON/spaces/ n declare the constant n at location spaces COMMON/ spacer/ q,r,t declare constants q,r, & t, starting at location spacer (4 bytes each)

COMMON/spacel / x(1001), yd 001), z(1001) declare (matrix) variables x, y & z, each 1001 data points long, with x starting at location spacel, followed by y & z (4 bytes each item)

Do 1 k=1 ,n loop n times, calculating the formula below and building a matrix of values x(k) = q + y(k)*(r*z(k+10) + t*z(k+11)) formula for variable x END go on to next task

The user processor can process the above using the power of its integrated processor architecture to expand the code out in parallel and execute it through three specialized high speed processors. The FORTRAN compiler for the user processor builds special 96-bit wide machine instructions that simultaneously direct the operations of three processors and several memories, namely, the local, global, instruction cache, data caches, and the like. The machine instructions are formed into two parallel streams: (1) an instruction for the organizational processor; and (2) an instruction for the scientific processor. Each instruction has either an organizational processor part or a math processor part, or both. The FORTRAN example is compiled below as an illustration: XY INSTRUCTIONS A INSTRUCTIONS EXPLANATION

START Rdf spacer+4 Fetch constant r (goes into top of FIFO) Rdf space1 +8044 [R1 ] Calculate address & fetch (z+10)+k (goes into top of FIFO)

Rdf spacer +8 Fetch constant t (goes into top of FIFO)

Move FIFO,M_X; Rdf spacel +8048 Move the bottom item in the FIFO (r) to M_x ; and calculate address & fetch (z+11)+k (goes into too of FIFO)

Moyf1 FIFO, M_X; Pdf spacel +4000[R1 ] Multiply (z+10)+k (next item at bottom of FIFO) times M_X (r); calculate address & fetch y (goes into top of FIFO) Move MR,F2 ;FIFO,M_X ; Rdf spacer Move the results of the previous multiply (MR) ([(z+10)+k]X(r]) to V, move the bottcm of the FIFO (t) to M_X; and fetch constant q

Mpyfl FIFO,M_X Multiply (z+11)+k (next item at bottom of FIFO) time f1 (t)

Move MR, W Move the results of the previous multiply (MR) ([(z+11)+k]X(t] to F3

F2 ,F3 ,FIFO Multiply F2 times F3 and add the bottcm entry in the FIFO (y) to the result

Move Ar,F2 ;FIFO,F1 Move the previous multiply & add results (AR) operation to F2, and move the bottom of the FIFO (q) to F1

A ddf F1 , F2, F 2; DecrBrnz R2 , START Add F1 to F2 ([q]+[y(k)*(r*z(k+10) + t*z(k+11))] and place the results in F2; decrement the loop count in R2, test for zero, and branch if non zero to START (branching takes 3 cycles, but the following two instructions will be executed on the way to the new branch location (START) - if branch is not taken, jump the following (continued) XY INSTRUCITOTS A INSTRUCTIONS EXPLANATION

2 instructions and begin executing

Move F2,OUT; Writef spacel-4[R1] Move F2 (completed calculation of x) to OUT (output buffer); Calculate the address for x (x[R1]) and store the value in its matrix in memory

Incr 4,R1 Increment R1 by 4

In the above compilation, the user processor sequence assembly program has 13 instructions. The number of assembly instructions is not relevant to the processing time in that the real measurement of processing time is in the number of machine cycles used and the cycle times of the machines required to complete the calculation. For purposes of example, in examining how many cycles the user processor uses, four control fields are assumed in the instructions identified as: A, B, C, and D. The four control fields are defined as follows:

(1) Field A controls the simple and complex math processors and associated registers;

(2) Field B controls data movements from/to data caches and the FIFO connection between math processors and the organizational processor;

(3) Field C controls address calculations and registers for the organizational processor, as well as data movements to the FIFO connection; and

(4) Field D controls reads/writes from/to local or global memory, along with branching operations.

The user processor executes the dual instruction stream in parallel using the four control fields as follows: CYC FIELD A FIELD B FIELD C FIELD D FIFO

1 Read r

2 Addr (z+10)+k Read z+10

3 Read t r

4 FIFO->F1 Addr (z+11)+k Read z+11 z+10

5 Mpy FIPO,F1 FIFO->MI Addr y+k Read y t

6 MR- >F2 FIFO->F1 Read q z+11

7 Mpy FIFO,F1 FIFO->MI y

8 MR- >F3 q/y

9 MoyAd F2,F3,FIFO FIFO->AI q

10 AR- >F2 FIFO->F1 —

11 Add F1 ,F2 Deer R2 BrNZ —

12 F2-OUT Addr x +k Write x —

13 Inc R1

CYCLS EXPLANATION

1 Memory read at spacer +4 (r)

2 Memory read at spacel +8044 (z+10)

3 Memory read at spacer +8 (t)

4 Move r to F1; memory read at spacel +8048 (z+11 )

5 MPY z+10 times r (FIFO moves to Multiplier Input (MI) first); memory read at spacel +4000 (y)

6 Move mpy results (MR) to F2, move FIFO (t) to F1 ; memory read at spacer (q) 7 Multiply z+11 times t (FIFO moves to Multiplier Input first)

8 Move mpy results (MR) to F3

9 Multiply F2 times F3 & add FIFO (y) (FIFO moves to Adder Input (Al) and is added to Multiplier Output)

10 Move Adder Results to F2, move FIFO (q) to F1 11 Add F1 to F2, decrement loop count, Branch if non zero result in R2, executing the following 2 instructions on the way, otherwise jump 2 instructions if zero result

12 Move F2 to Output buffer (OUT), store results in Spacel -4 (x)

13 Increment matrix count

As noted above, the user processor assembly program consisted of 13 instructions. However, the user processor required only 13 cycles to execute the 13 instructions because of the overlapping effect of the organizational processor, simple math processor and complex math processor. In the preferred embodiment, the basic machine cycle time of the user processor is 143 nanoseconds. In the processor, one pass through the loop takes 1859 nanoseconds, providing an equivalent speed of 538,000 passes per second, or 6.99 MIPS and 2.69 MFLOPS.

The integrated, multicomputer data processing system is well designed for higher mathematics applications like simulation, signal processing, image processing, and the like. The scientific user or an intelligent compiler must determine how to convert mathematical constructs like Fast Fourier Transforms (FFT), convolutions, function evaluations, and numerical integrations into a complex series of add/subtract/shift/multiply/divide instructions, with other operations like trig functions, square roots, and the like, processed by elaborate floating point subroutines.

The integrated, multicomputer data processing system described herein uses the following processing methods for higher mathematics applications:

USER PROCESSOR

ITEM MATH FUNCTION METHOD CAPABILITY

1 Linear Equations Dot Product Simultaneous add & 2 Convolutions Dot Product multiply (Floating point or integer)

Single cycle execution of integer and single precision floating point operations

SAXPY/DAXPY Three independently addressable memories (2 data caches and main memory)

3 Function Evaluation Polyncmial Perform an add and Evaluation initiate a multiply with the sum

Micrcccded Newton method with RCM loaded start point

Iterative Methods Special exponent processing hardware

(table continued next page) USER PROCESSOR ITEM MATH FUNCTION METHOD CAPABΓLITY

4 Numerical Integration Function Evaluation Microcoded

FORTRAN Intrinsic

Cot Prcduct See above 5 Fast Fcurier Transform Butterfly Single precision trig tables in RCM

Bit Reverse addressing of array memories

Simultaneous add & multiply An example of how the user processor executes a Fast

Fourier Transform is as follows . In the mathematical treatment of Fourier Series , it is common to run across the following calculations : x ' = ∑(x+Δy) , and Y' = (x-Δy) , where Δ y = ( cos θ + isinθ) (y₁ +iy₂ )

= [ (y₁ cosθ - y₂sinθ) + i (y₁ sinθ + y₂cosθ) ]

The above calculation is to be repeated and the results thereof summed for "n" iterations. In this example, the more iterations made by the user processor, the greater the accuracy of the results of the calculation.

In this example, each pass would require four multiply operations and six additions (comprising adding the two numbers in each of two sets of parentheses, plus the cumulative summing of x' and y' results at the end of each pass).

The user processor has a table of single precision floating point trigonometric values in ROM, each 4K entries long, A request for a sine or cosine value is a simple fetch from fast cache which can be performed without any waiting time for a cycle. The user processor performs simultaneous adds and multiplies via the simple math processor and complex math processor. Thus, the six adds and four multiplies only take six cycles, for a total of approximately 14 cycles per pass.

Thus, the user processor would operate at 4.99 MFLOPS calculated as follows: 14 cycles x 143 ns/cycle = 2002 ns/pass, and 499,000 passes/sec, each pass with 10 operations.

The integrated, multicomputer data processing system architecture, operating system and compiler operations are designed such that there is a mixture of vector and nonvector operations to be processed. The user processor actually operates as a high-speed processor due to the compiler's ability to rearrange the processing task to capitalize on more parallel opportunities in the function being calculated without any degradation in performance due to the mix of operations.

Claims

WHAT IS CLAIMED IS

1. An integrated, multicomputer data processing system having a control section including means for system input and output and at least one user processor section having a math processor, said control section comprising an operating system processor for controlling transfer of data and instruction addresses to the user processor section, said operating system processor including means for generating command signals which control loading of programs and data signals, and addressing signals into the user processor section while the user processor section is executing a programmed arithmetic operation and for enabling the user processor section to execute the programmed arithmetic operation on the loaded data using instructions fetched from program memory within the user processor section in response to the instruction addresses; a system controller operatively coupled to said operating system processor for receiving said command signals, said system controller being responsive to the command signals to transfer the data and instruction addresses to the user processor section and being responsive to the command signal received from the operating system processor to enable the user processor section to execute the programmed arithmetic operations on the loaded data in response to instructions fetched from said program memory, said system controller being responsive to the operating system processor for concurrently loading instruction addresses and data to be executed by the user processor section and for transferring therefrom the results of an executed programmed arithmetic operation onto the means for system input and output under control of the operating system processor; and means operatively coupled between said system controller and user processor section for transferring data and instruction addresses therebetween under control of the operating system processor and for loading data and instruction addresses signals into the user processor section under control of the operating system processor, said data and instruction addresses transferring means further including means for loading instructions and data into the user processor section including the program memory and for transferring the results of executed programmed arithmetic operations from the user processor section, to the means for system input and output under control of the operating system processor; said user processor section including means responsive to instruction addresses for fetching instructions from the loaded program memory in the user processor sectionand for enabling the math processor to execute programmed arithmetic operations on the loaded data in response to the signals.

2. The integrated, multicomputer data processing system of claim 1 wherein said data and instruction addresses transferring means include a system memory bus operatively connected between the system controller and user processor for loading data and instruction addresses into the user processor section; and a staging data bus operatively connected between the system controller and user processor for loading programmed arithmetic operations, data and instruction addresses into said user processor section including said math processor and said program memory.

3. The integrated, multicomputer data processing system of claim 2 wherein said integrated, multicomputer data processing system further includes a high-speed input-output section and a global memory and wherein said control section further includes an input/output data transfer means for system input and output under control of said operating system processor.

4. The integrated, multicomputer data processing system of claim 3 wherein said user processor section includes a random access memory operatively connected to said system memory bus; and wherein said operating system processor includes a high-speed random access memory operatively coupled to said input/output data transfer means and said operating system processor, said high-speed memory being capable of storing digital data representing a task control block including instruction addresses for providing programmed arithmetic operations to be performed by the user processor, the length of the programmed arithmetic operation, and its starting address in the user processor section random access memory.

5. The integrated, multicomputer data processing system of claim 2 wherein the user processor includes an organizational processor operatively connected to said system memory bus, said staging data bus, said user processor section random access memory, said program memory, and the math processor, said organizational processor including a single stage output buffer for enabling the organizational processor to transfer the results of an executed programmed arithmetic operation into the random access memory and a multistage, first-in-first-out input buffer for enabling asynchronous transfers of data from the random access memory to the organizational processor.

6. An integrated, multicomputer data processing system having a global memory section, a high-speed input-output section, a control section including an input/output data transfer means for system input and output and at least one user processor section having at least two arithmetic processors and a local memory, said control section comprising an operating system processor including memory means for controlling transfer of data and instruction addresses to the user processor section, said operating system processor being operatively coupled to the input/output data transfer means for system input and output and being adapted to program arithmetic operations to be executed by the user processor section, said operating system processor including means for generating command signals which control loading of instruction addresses, which are used as a program for arithmetic operations, and data into the user processor section while the user processor section is executing a programmed arithmetic operation and for generating command signals for enabling the user processor section to execute the programmed arithmetic operation on the loaded data using instructions fetched from a program memory within the user processor section in response to the instruction addresses; a system controller operatively coupled to said operating system processor for receiving said command signals, said system controller being responsive to the command signals to transfer the instruction addresses and data to the user processor section and being responsive to the command signals received from the operating system processor to enable the user processor section to execute the program of arithmetic operations on the loaded data in response to instructions fetched from said program memory, said system controller being responsive to the operating system processor for concurrently loading instruction addresses and data to be executed by the user processor section and for transferring therefrom the results of an executed programmed arithmetic operation onto the input/output data transfer means under control of the operating system processor; a system memory bus operatively coupled between said system controller, the data global memory section, the input-output section and user processor section for transferring data and instruction addresses therebetween under control of the operating system processor and for loading data and instruction addresses into the user processor section under control of the operating system processor; and a staging data bus operatively connected between said system controller and said user processor section for loading instruction addresses and data into the user processor section including the program memory and for transferring the results of executed programmed arithmetic operations from the user processor section to the input/output data transfer means under control of the operating system processor; said user processor section including a buffering means for enabling programmed, asynchronous transfer of numerical results between the user processor and the memory means and means responsive to instruction addresses for fetching instructions from the loaded program memory in the user processor section and for enabling the arithmetic processors to execute programmed arithmetic operations on the loaded data in response to the command signals enabling the user processor . section to continuously execute programmed arithmetic operations and the executed arithmetic operations being continuously processed through the user processor section to the memory means.

7. The data processing system of claim 6 wherein the user processor section further includes an instruction seguencer which is responsive to instructions fetched from the random access memory means in response to instruction addresses; an organizational processor including means for directing instruction addresses into said random access memory means and for transferring data between the user processor and memory means by means of an output register and input buffering means; wherein said arithmetic processors comprise at least one math processor which is adapted to perform basic arithmetic functions, including addition, multiplication, division in a single clock cycle and integer calculations which are performed in a predetermined number of clock cycles.

8. The data processing system of claim 7 wherein said instruction sequencer is responsive to an instruction stream fetched from the random access memory in response to the instruction addresses and for decoding said instruction stream into control signals and applying the control signals to the math processor enabling the same to execute the programmed arithmetic operations and to transfer numerical operations results therefrom via the output register means to the memory means in a single clock cycle.

9. The data processing system of claim 8 wherein said local memory includes a means for transferring data to the organizational processor through an input buffering means.

10. The data processing system of claim 6 further comprising a second user processor section having an organizational processor, a math processor and a memory means, said second user processor section being operatively coupled to said system memory bus and said staging data bus and being responsive to said operating system processor to program the math processor of said second user processor section to perform programmed arithmetic operations producing numerical results which are capable of being applied by the math processor of said second user processor section to the organizational processor and from the organizational processor through the output register means to the memory means which is capable of transferring the same over the system memory bus to another user processor section; said operating system processor including means for controlling transfer of numerical operations results between each of said user processor sections over said system memory bus.

11. The data processing system of claim 10 wherein said second user processor section includes an input buffering means for asynchronously transferring data from its memory means into its organizational processor.

12. The data processing system of claim 6 wherein said input-output section comprises a high-speed input output processor including means for transferring data to the system memory bus under control of said operating system processor; and a multiple port data global memory which is operatively connected to a high-speed input-output processor.

13. The data processing system of claim 7 wherein said instruction sequencer includes an instruction buffer register which is adapted to receive and store an instruction stream fetched from the local memory; a rotating network operatively coupled to said instruction buffer register to receive the instruction stream from the instruction buffer register, said rotating network including means for determining if the instruction is for a selected one of the math processor and organizational processor and to decode a section of the instruction stream to produce a control signal which is applied to the selected one of the math processor and organizational processor; and at least one shifting means operatively coupled to said rotating network for receiving the same instruction stream and including means for determining if the instruction is for the other of said math processor and organizational processor wherein said at least one shifting means upon making such a determination shifts the instruction stream in a predetermined sequence until the section of the instruction stream relating to said other of the math processor and organizational processor is in position to be decoded and for decoding said shifted section of the instruction stream to produce a control signal which is applied to the other of the math processor and organizational processor.

14. The data processing system of claim 13 wherein said user processor section includes an output register which stores numerical results of the math processor operations and which is capable of transferring said numerical results to the local memory in a clocked cycle; and a first-in-first-out buffer register operatively coupled between said organizational processor and said local memory for storing data to be applied as inputs to the math processor and organizational processor and for selectively transferring data therefrom in clocked cycles to the applicable math processor and organizational processor for further processing.

15. The data processing system of claim 14 wherein said math processor further includes a microprogramming means which is responsive to and instruction to produce microcoded instructions which are used for controlling the numerical operation of the math processor of the user processor section requiring more than one clocked cycle to complete.

16. The data processing system of claim 6 further comprising a clocking means operatively coupled to each of said operating system processor, said system controller, and the user processor section to synchronize and clock the operations thereof.

17. A user processor adapted for use in a data processing system having a system memory bus, a system staging bus and an operating system processor for controlling data and instruction addresses flow into the user processor, said user processor comprising a random access memory including a first data bus adapted to transfer data between the random access memory and the system memory bus and a second data bus for internal data transfer between the random access memory and the user processor independently of the first data bus; an organizational processor operatively coupled to said random access memory through said second data bus through an input buffer means for fetching data from the random access memory, said organizational processor including means for interactive transfer of data from the system memory bus to said organizational processor, said organizational processor including means responsive to the system staging bus for receiving instruction addresses and data which are loaded into the user processor while the user processor is executing a programmed arithmetic operation; a high-speed program memory including means operatively coupled to the system staging bus wherein instructions are loaded therein from the system staging bus under control of the operating system processor, said high-speed program memory including means responsive to instruction addresses for transferring instructions stored therein as a portion of an instruction stream; instruction sequencing means operatively coupled to the high-speed program memory and including means responsive to the instruction stream from the program memory for decoding the same into control signals; and a math processor and said organizational processor operatively coupled to said instruction sequencer for receiving the control signals, said math processor being operatively connected to said organizational processor and to said random access memory through a single stage output register for enabling direct transfer of results from arithmetic operations from the math processor into the random access memory, said random access memory being operatively connected to the math processor and organizational processor for applying input data thereto through a multistage input buffered register for enabling relative data movement between the organizational processor and random access memory, said math processor including data cache memories adapted to be operatively connected to the system staging bus enabling direct loading of the math processor data cache memories with blocks of data as a fast array memory under control of the operating system processor, said math processor including integer processing means for executing numerical calculations and means for transferring results of the numerical calculations to the random access memory through said single stage output register enabling the random access memory to transfer the results of numerical calculations within the data processing system.

18. The user processor of claim 17 wherein the scientific processor integrated math processing means includes a numerical processor section for executing mathematical operations including shifts, logical operations, additions, subtractions, data type conversions, and bit and byte manipulations; and a pair of dual random access, high-speed memory data caches capable of operating independently as fast program stacks and register files, said pair of dual random access, high-speed memory data caches being operatively coupled to said math processor for use of one pair thereof as fast program stacks while the other pair thereof is being loaded for a subsequent operation.

19. The user processor of claim 18 wherein said pair of dual random access, high-speed memory data caches are operatively connected to the system staging bus enabling the system staging bus to load the same with data under control of an operating system processor.

20. The user processor of claim 19 wherein the math processor means is capable of executing integer arithmetic functions including integer and floating point multiplications and division, microcoded intrinsic math functions including sine, cosine, square root, tangent, fast fourier transforms, dot products and fast trigonometric table look-ups.

21. A method for high-speed processing of data requiring vector processing and nonvector processing comprising the steps of loading instruction addresses and data over a system staging bus through a system controller of an integrated, multicomputer data processing system into a user processor having an organizational processor, a random access memory, an instruction cache, and a math processor; transferring data over a system memory bus under control of said operating system processor into the random access memory of the user processor; generating from an operating system processor command signals which are applied to said system controller commanding the system controller to continuously load the user processor with instruction addresses and data enabling the user processor to operate continuously; generating from an operating system processor command signals which are applied to the system controller to command the user processor to execute, using the organizational processor and math processor, arithmetic operations in response to control signals decoded from an instruction stream; and transferring the results of the arithmetic operations from the math processor to the random access memory for further processing.

22. The method of claim 21 further comprising the step of processing with a front end language processor calculations to be performed wherein each calculation comprises a mix of mathematical operations and for applying the front end language processed calculations to an input/output data transfer means for input to the operation system processor.

23. The method of claim 21 further comprising the step of loading data and addresses into the system memory bus from a global memory.

24. The method of claim 23 further comprising the step of loading data and addresses onto the system memory bus from a high-speed input-output processor.

25. The method of claim 22 comprising the step of rearranging the mix of mathematical operations on the front end language processed calculations using a compiler which applies the same to a front end common code generator and optimizer for producing an executable user processor object code enabling the user processor to execute in parallel the optimum mix of mathematical operations.

26. A user processor adapted for use in a data processing system comprising a random access memory for storing data upon which arithmetic operations are to be performed and for storing results of arithmetic operations performed on said data; an organizational processor operatively coupled to said random access memory to store data thereon and to receive data therefrom through a multistage, input buffering means which is capable of storing data fetched from said random access memory by said organizational processor and upon which arithmetic operations are to be performed, said multistage, input buffering means being capable of storing data upon which an arithmetic operation is to be performed and transferring therefrom in response to said organizational processor stored data from a selected stage of said multistage input buffering means; a math processor operatively coupled to said organizational processor for transferring data therebetween and for receiving data stored in a selected stage of said multistage, input buffering means, said math processor being operatively coupled to said random access memory through a single-stage output register and for transferring results of arithmetic operations performed by the math processor through said output register to the random access memory; a program memory having instructions stored at selected locations therein, said program memory including means which are responsive to instruction addresses applied thereto for reading out instructions stored at a plurality of selected locations therein as an instruction stream; and an instruction extracting means operatively coupled to the program memory for receiving said instruction stream, said instruction extracting means further including means responsive to the instruction stream for extracting therefrom instructions for at least one of the organizational processor and math processor and for decoding the same into a control signal and applying the control signal to at least one of said organizational processor and said math processor to commence an arithmetic operation; said organizational processor being operative to schedule loading of data from the said multistage, input buffering means into said math processor concurrently with the applying of a control signal by said instruction extracting means into the math processor, said organizational processor being capable of loading data from said selected stage of the multistage, input buffering means into said math processor concurrently with said instruction extracting means applying a control signal to said math processor representing the arithmetic operation to be performed on the loaded data and with the results of a performed arithmetic operation being transferred from the math processor into said single-stage output register enabling the math processor to immediately commence the next arithmetic operation, said organizational processor being responsive to a control signal applied thereto by said instruction extracting means to perform a separate arithmetic operation in parallel including storing the results thereof in said random access memory as said math processor performs an arithmetic operation on the data loaded therein under control of the organizational processor.