DISTRIBUTIVE, DIGITAL MAXIMIZATION FUNCTION ARCHITECTURE AND METHOD
Technical Field The instant invention relates to a computer processor architecture, and specifically to an architecture which provides a maximization structure and method of determining which of several processor nodes in an array contains a maximum data value.
Background Art
Neural networks are a form of architecture which enables a computer to closely approximate human thought processes. One form of neural network architecture enables single instruction stream, multiple data stream
(SIMD) operations which allow a single command to direct a number of processors, and hence data sets, simultaneously.
There are several important practical problems that cannot be solved using existing, conventional algorithms executed by traditional, conventional computers. These problems are often incompletely specified and are characterized by many weak constraints requiring large search spaces.
The processing of primary cognitive information by computers, such as computer speech recognition, computer vision, and robotic control, fall into this category. Traditional computational models bog down to the point of failure under the computational load if they are tasked to solve these types of problems. Yet animals perform these tasks using neurons that are millions of times slower than transistors. Feldman's 100-step rule states that a "human" cognitive process having a time of 500 msec, can be accomplished in 5 msec neuron switching time. This implies that there are two vastly different computational models at work. It also suggests that in order to build computers that will do what nervous systems do, the computers should be structured more like nervous systems.
A nervous system, and a neurocomputational computer, is characterized by continuous, non-symbolic, and massively parallel structure that is fault-tolerant of input noise and hardware failure. Representations, ie, the input, are distributed among groups of computing elements, which independently reach a result or conclusion, and which then generalize and
interpolate information to reach a final output conclusion.
Put another way, connectionist/neural networks search for "good" solutions using massively parallel computations of many small computing elements. The model is one of parallel hypothesis generation and relaxation to the dominant, or "most-likely" hypothesis. The search speed is more or less independent of the size of the search space. Learning is a process of incrementally changing the connection (synaptic) strengths, as opposed to allocating data structures. "Programming" in such a neural network is by example. One particularly useful operation which is performed by neural networks is character recognition, which may be used to input printed or handwritten material into an electronic storage device. It is necessary for the system to recognize a particular character from among hundreds of possible characters. A matrix may be provided for each possible character, which matrix is stored in a processor node, for comparison to a similar matrix which is generated by analyzing the character to be stored. The input matrix is compared, by all processor nodes to the values contained therein, and the best match, the maximum correlation between the stored data and the input data, determines how the input matrix will be interpreted and stored electronically. The problem of how to distribute the comparison function and determination of best match across a processor array with potentially thousands of processor nodes may be solved in a variety of ways. The most efficient way to determine a best match would be to use an analog system, where the magnitude of a current is proportional to the number being maximized. There are a number of electrical problems with this approach. The analog approach has limited precision and will not work well across multiple integrated circuits, ie, it is usually limited to a single chip and therefor limited to the number of PNs which may be placed on a single chip.
Disclosure of the Invention An object of the invention is to provide a maximization architecture which determines, in an array of processor nodes, which processor node has the maximum data figure contained therein.
Another object of the invention is to provide a maximization
architecture which allows selectable, arbitrarily large, precision of determining maximization.
A further object of the invention is to provide the maximization architecture which analyzes a data value on a bit-by-bit method to produce a winner-take-all result without necessarily analyzing ever bit of every data figure contained in the array of processor nodes.
Another object of the invention is to provide a method of determining which of an array of processor nodes contains a maximum data value. Still another object of the invention is to provide a method of determining, without necessarily examining every bit of data value which is the maximum data value contained in an array of processor nodes.
Yet another object of the invention is to provide a method of determining, without necessarily examining every bit of data value which is the maximum data value contained in an array of processor nodes which extend over multiple integrated circuits.
The maximization architecture of the invention includes an array of processor nodes wherein each node has a manipulation unit contained therein. Each node is connected to an input bus and to an output bus. A data register is located in each processor node and contains a data figure, which consists of a plurality of segments, or bits, wherein each segment or bit has a value. A maximization mechanism is located in each processor node and is connected to an arbitration bus which extends between adjacent processor nodes and an arbitration mechanism, which is connected to the arbitration bus, for comparing a value of a bit to a signal which is transmitted on the arbitration bus and for subsequently transmitting a comparison indicator. Each processor node includes first and second indicator retainers, which are connected to the arbitration means for retaining the comparison indicator. A flag mechanism is provided and flags the processor node which contains the maximum data figure.
The method of the invention includes providing the previously identified structure and initially setting the flag mechanism of each processor node with a positive flag. The value of a subject bit is examined in the data
registered to determine whether it has a value of one or zero. If and only if the value is zero, and the value of the subject segment of at least one other processor node is one, the flag register of a designated processor node is set with a negative flag. The value of the flag register is placed into the arbitration mechanism and transmitted to the indicator retainers. The steps are reiterated until only one processor node has a positive flag. The remaining positive flag indicates that particular processor node contains the maximum data figure.
These and other objects and advantages of the invention will become more fully apparent as the disclosure which follows is read in conjunction with the drawings.
Brief Description of the Drawings Fig. 1 is a schematic diagram of a broadcast communication pattern of communication nodes contained within processor nodes of the invention.
Fig. 2 is a block diagram of a portion of an array of processor nodes of the invention containing the maximization architecture of the invention.
Fig. 3 is a block diagram of a single processor node of the invention.
Fig. 4 is a block diagram of the maximization function of the mvention.
Best Mode for Carrying Out the Invention Referring initially to Fig. 1, broadcast patterns in an array 10 of processor nodes (PNs) which contain connection nodes (CNs) 0-7 (12, 14, 16, 18, 20, 22, 24 and 26, respectively) are depicted. A CN is a state associated with an emulated node in a neural network located in a PN. Each PN may have several CNs located therein. The CNs are often arranged in "layers", with CN0 - CN3 comprising one layer, while CN4 - CN7 comprise a second layer. The array depicted would generally include four PNs, such as PN0, PNl, PN2 and PN3, (28, 30, 32 and 34, respectively) depicted in Fig. 2, with CN0 and CN4 being located in PN0, CN2 and CN5 being located in PN2, etc. There may be more than two layers of connection nodes in any one processor node
or in any array of processor nodes. A typical array of processor nodes may include hundreds or thousands of individual PNs.
The connection nodes operate in what is referred to as a broadcast hierarchy, wherein each of connection nodes 0-3 broadcast to each of connection nodes 4-7. An illustrative technique for arranging such a broadcast hierarchy is disclosed in U.S. Patent No. 4,796,199, NEURALr MODEL INFORMAΗON-HANDUNG ARCHITECTURE AND METHOD, to Hammerstrom et al, January 3, 1989, which is incorporated herein by reference. Conceptually, the available processor nodes may be thought of as a "layer" of processors, each executing its function (multiply, accumulate, and increment weight index) for each input, on each clock, wherein one processor node broadcasts its output to all other processor nodes. By using the output processor node arrangement described herein, it is possible to provide n2 connections in n clocks using only a two layer arrangement. Known, conventional SIMD structures may accomplish n2 connections in n clocks, but require a three layer configuration, or 50% more structure. The boundaries of the individual chips do not interrupt broadcast through processor node arrays, as the arrays may span as many chips as are provided in the architecture.
All of the PN in array 10 are connected to an input bus 36 and an output bus 38. Each processor node includes a manipulation unit 40, which is depicted in greater detail in Fig. 3 and will be described later herein.
The maximization architecture of each PN is depicted generally at 42. The maximization architecture includes a data register 44 which contains a data figure. Subscripts are used in connection with the reference number to designate the particular PN and which a structure is located. For instance, each PN includes a data register 44. Reference numeral 28 refers to PN0, which includes data register 44j, therein. The data figure consists of a plurality of segments or bits, each segment or bit having a value. The technique which is used to arrive at the value is data register 44 will be described subsequently herein. Each PN also includes a left flip-flop 46 and a right flip-flop 48. The left flip-flops are also referred to herein as a first
indicator retainer means while the right flip-flops are referred to herein as a second indicator retainer means. As depicted in Fig. 2, each left flip-flop is connected to the data register in the left most, immediate-adjacent processor node by a connection 50. Likewise, the right flip-flops are connected to the immediate right most, immediate-adjacent processor node data register by a connection 52. Connections 50 and 52 comprise what is referred to herein as an arbitration bus 53. It should be appreciated that while the connections are shown extending directly between the appropriate flip-flops and the data register, the arbitration bus may be formed as part of the input/output bus structure.
Each processor node includes an OR gate 54, which is also referred to herein as arbitration means. The inputs to OR gates 54 are designated by reference numerals 56, 58 and come from connections 50, 52 respectively. The outputs 60 of OR gates 54 are connected to the left and right flip-flops.
A max flag register 62, also referred to herein as flag means or maximum value indicator, receives and holds a value based on the comparison and arbitration which takes place amongst the maximization architecture of the various processor nodes. Turning now to Fig. 3, the components of manipulation unit 40 will be described in greater detail. Manipulation unit 40 includes an input unit 62 which is connected to input bus 36 and output bus 38. A processor node controller 64 is provided to establish operational parameters for each processor node. An addition unit 66 provides for addition operations and receives input from input unit 62. A multiplier unit 68 is provided and is connected to both the input and output buses and addition unit 66.
A register unit 70 contains an array of registers, which may include data register 44 as well as flag register 62. In the preferred embodiment, each processor node includes an array of 32 16-bit registers. A number of other arrangements may be utilized.
A weight address generation unit 72 is provided and computes the next address for a weight memory unit 74. In the preferred embodiment, the address may be set in one of two ways: (1) by a direct write to a weight
address register or (2) by asserting a command which causes the contents of a weight offset register to be added to the current contents of the memory address register, thereby producing a new register.
An output unit 76 is provided to store data prior to the data being transmitted on output bus 38. Output unit 76 may include an output buffer, which receives data from the remainder of the output unit prior to the data being transmitted on the output bus. The data is transmitted to output bus 38 by means of one or more connection nodes, such as CNO or CN4, for instance, which are part of the output unit of PNO. While only single input and output buses are depicted in the drawings for the sake of simplicity, it should be appreciated that multiple input and/or output buses may be provided and if so, the various components of the manipulation unit will have connections to each input and output bus. Additionally, the input/output bus may be arranged to connect only to the input and output units, and a separate internal PN bus may be provided to handle communications among the various components of the PN.
Returning now to Fig. 2, the operation of processor node array 10 as it determines which processor node has a maximum data figure contained therein will be described. This description begins with the assumption that each PN in the array has a data figure therein, which is the result of the manipulation of data by the PN and that the maximization function of the invention will determine which PN in the array has the maximum data figure contained in its data register.
Turning now to Fig. 4, the maximization function of the invention is depicted generally at 78. The first step in the maximization function is to set max flag (mxflg) 62 to "1" in each PN, block 80. Next, flip-flops 46 and 48, also referred to as flag registers, are cleared, i.e., set to zero, block 82.
At this point, the most significant bit (MSB), i.e., the left most bit, in data registers 44 is examined, block 84. Block 86 asks whether the most significant bit is equal to one. If the answer to 86 is "yes", the value of mxflg is loaded into OR gate 54, which results in a "1" being propagated throughout the right and left flip-flops to all of the processor nodes in the array, block 88. This step enables any processor node that answered "no" to block 86 to inquire
whether any other most significant bit was equal to one, block 90. If block 90 is answered in the affirmative, the mxflg is set to zero for that designated processor node, block 92.
If the answer to block 90 is "no", the branches of the block diagram rejoin, and the array determines if a counter is greater than the number of bits in the data register, block 94. If the answer is "no", the data register is shifted left, block 96, the counter is incremented, block 98, and the routine returns to block 82 where the flag registers are cleared.
The function is iterated until such time as only one mxflg is equal to "1", or the counter is greater than the number of bits originally located in data registers 44. A ' es" answer to block 94 indicates that there are at least two PNs which have equal values in them, all of the bits in data register 44 having been compared and two or more of the registers containing equal values. In this situation, a tie-breaker routine may be executed, block 100. Tie-breaker routine 100 may be determined by individual programers who decide, according to the criteria being measured, which PN will be determined to hold the value of interest, and consequently, the data of interest.
As previously noted, electrical delays preclude the use of a single wire or bus to connect all of the PNs. The use of a single wire or bus is simply not sufficient to overcome the electrical problems. Neither can a single wire or bus be used to cross multiple chips.
It is possible to connect all of the processor nodes together on a single line or bus, precharge the line or bus in a single phase with the input matrix, examine all of the processor nodes for the best match, and then discharge the line or bus on the next phase. This solution may be operable with a few processor nodes, but with hundreds or thousands of processor nodes, the physical organization required becomes unwieldy. The first difficulty is that it will take more than one phase, or clock, for a single processor node to discharge the line. The second difficulty is that, as other functions are carried out by the processor nodes, internal counters and indicators may change values. Finally, the required circuitry becomes too large to quickly conduct the signals synchronously.
The motivation for providing a right and left flip-flop in each PN
is to overcome the electrical problems associated with a single connection. Each flip-flop acts similarly to an axon in an animal neuron: it is set by the first signal, and the signal "remains" in the flip-flop until the flip-flop is instructed to change the signal. Therefor, when one PN propagates a "1", the "1" is propagated indefinitely to all other PNs in the network until the mechanism is cleared.
To briefly follow the maximization function through a determination of which of PNs 0-3 have the largest data figure, in the first iteration, mxflg3 423, -s set t0 zero» a negative flag, because the MSB in data register 443 is zero, while the MSBs of the data registers in PNs 0-2 is "1". OR gates 54^, 54^ and 542 will all receive a "1" from their respective flag registers, and a signal which is representative of "1" will be propagated to all of the left and right flip-flops in all of the PNs in array 10. Because the counter is still low, block 94, ie, the routine has been iterated fewer times than there are bits in the data register, the segments of the data registers will be shifted, block 96, the counter incremented, block 98, and the max function iterated again beginning with block 82, clearing the flip-flops.
In the second iteration, all of the remaining MSBs are 0, so no mxflgs are changed. It should be remembered that while PN3 is no longer participating in the examination step, block 84, it is still propagating right and left through the flip-flops. At the end of the second iteration, the data registers will shift left again and the third iteration will begin.
In the third iteration, PN0 will set its mxflg to zero. PN2 will set its mxflg to zero in the fourth iteration, which will result in the end of the max function. The function will be iterated until block 94 is answered "yes". If there is no tie, as in the case being described, the tie breaker routine, block 100, if present, will ignore the data, and the maximization function will be ended, block 102. Thus, PN2 will be determined to have the largest data figure by the maximization function and architecture of the invention. This means that PN2 contains the best match to the input data and represents the particular matrix value being sought.
The following code is a simplification of the code that describes the actual CMOS implementation of the PN structure in a neurocomputer chip.
The code shown below is in the C programming language embellished by certain predefined macros. The code is used as a register transfer level description language in the actual implementation of the circuitry described here. Bolded text indicates a signal, hardware or firmware component, or a phase or clock cycle.
The phi and ph2 variables simulate the two phases in the two- phase, non-overlapping clock used to implement dynamic MOS devices.
The post-fix "_D" on each signal name, means a delayed version of the signal, "_B" means a bus (more than one signal line), "_1" means a dynamic signal that is only valid during phi. These can be combined arbitrarily. Maximization arbitration logic, maximization function test signal (rmxrd), and the value in the mxflg register 42 must be set explicitly by the programmer before beginning the maximization function. maxin = leftff OR rghtff; leftff and rghtff are the flip-flops which are set by signals, frmleft, frmright, arriving from the PNs directly to the left and to the right of the specific PN. maxin is then sampled and the values on the arbitration bus (50, 52) tested: if ( (ph2) ANDb (acval) ANDb (rmxr _2) ) { if (mxflg) { if ( (logneg==0) ANDb (maxin==l) ) roxflg=0; }} acval indicates that the max function test signal, rmxrd_2, is valid. logneg is the most significant bit of the logic/shifter output. The shifter is used to shift the value being checked left by one bit each iteration, mxflg indicates whether this PN is still in the running for the maximum value. If maxin is 1 and logneg is zero, then it knows that there is at least one other PN that has a larger number (a 1 in the same position that he has a 0). if ( ( (phi) ORb (ph2) ) ANDb (acval) ANDb (clrdn) )
{ rghtff=0 ; leftff=0 ; tolef t_B=0 ; torght_B=0 ; frmleft_B=0 ; f πnrght_B=0 ; } clrdn clears flip-flops 46, 48 which indicate the state of other PNs in the array to the PN in which flip-flops 46, 48 are located. Signals are provided to the PNs to the left and right of the specific PN. The rghtff and
leftff signals are used to emulate a large OR gate and cooperate with OR gates 54. That is, they are cleared and then whenever any PN generates a signal, which is done whenever data register 44 has a 1 as the most significant bit of the data figure being maximized (and shifted) the flip-flops are set by the PN with the 1 which propagates a signal along the arbitration bus. The architecture also propagates signals left and right, like dominos. The result is that if any PN drives its flip-flops, all flip-flops in all PNs are set, Each PN can read its own flip-flops to determine if anybody drove the "OR" function, which indicates that some PN has a 1 as the most significant bit in its data register. The following is a description of the structure and signals which are used to propagate values from the OR gates in both directions along the arbitration bus. if ((mxen) ANDb (glbarb==0) ANDb (seqarb==0) ) { rghtff OR= BIT(frmrght_B,0) ; leftff OR= (BIT(frmleft_B,0) AND INV(mxsw) ) ; leftff OR= rghtff; rghtff OR= leftff; if ((phi) ANDb (acval) ANDb (rmxdrv_l) ANDb (logneg) ANDb (mxflg)) { rghtff=l; leftff=l;} torght_B OR= rghtff; toleft B OR= (leftff AND INV(mxsw));}
If mxflg 62 is set and the rmxdrv_l signal is asserted and valid, (acval), and logneg = = 1, indicating the most significant bit of the logic/shifter unit is 1, then the structure propagates l's to the left and right PNs. torght_B and toleftJB are signals which propagate l's through to the left or right along the arbitration bus. mxsw is used to disconnect one region from another so that local maximization functions may be performed.
The provision of the maximization architecture of the processor node array provides a structure in which all PNs in the array have general information about the state of other PNs in the array. As previously noted, one way to provide this information would be to charge a bus or line which connects to all PNs and let all of the PNs compare the value on the line or bus with the value contained in the PN data register. However, this requires an extraordinary amount of energy and, in a fault tolerant system, such as a neural network, electrical delays and other electrical considerations may affect
the integrity of the signal on such a wide spread bus. By providing the flag registers, or flip-flops, in each PN, data is shared amongst all of the PNs in the array through the use of very short connections, the right and left flip-flops acting as latches, where the output goes to a known state at a given time, thereby providing synchronous data to all of the processor nodes in the array. In the event that there are more numbers to be maximized than there are processor nodes, a maximization can be performed upon several data figures contained in a single PN to determine which of the data figures has the largest value, and the maximization function amongst the PNs then can be run to determine which PN contains the maximum data figure.
The maximization architecture and function may also be used to determine a minimum data figure by providing an additional step of subtracting the segment with its values from 1, and then performing the previously described maximization function. Although a preferred embodiment of the architecture and a preferred method of performing the maximization function has been described, it should be appreciated that variations and modifications may be made thereto without departing from the scope of the invention as defined in the appended claims. Industrial Application
Processors constructed according to the invention are useful in neural network systems which may be used to simulate human brain functions in analysis and decision making applications, such as character recognition and robotic control.