US20040254965A1

US20040254965A1 - Apparatus for variable word length computing in an array processor

Info

Publication number: US20040254965A1
Application number: US10/469,518
Authority: US
Inventors: Eric Giernalczyk; Malcolm Stewart
Original assignee: ATSANA SEMICONDUCTOR CORP
Current assignee: MtekVision Co Ltd
Priority date: 2001-03-02
Filing date: 2002-03-04
Publication date: 2004-12-16
Also published as: EP1384158A2; US7272691B2; EP1381957A2; ATE404923T1; WO2002071246A3; CA2478570A1; AU2002252863A1; WO2002071239A2; US20070118721A1; CA2478573A1; EP1384158B1; AU2002240742A1; CA2478573C; CA2478571A1; WO2002071246A2; DE60228223D1; WO2002071240A3; EP1384160A2; AU2002238325A1; WO2002071239A3

Abstract

A computational unit comprises a processor having a plurality of processing elements, each having an arithmetic logic unit, and a controller for controlling the processor elements. The processor can provide a respectivef bit of a multiple bit word to each of the processor elements and enables signals to be transmitted between the arithmetic logic units to enable the units to perform a parallel operation on the bits of the multiple bit word. Extension circuitry is provided for selectively coupling one or more computational units together to combine their parallel processing capability.

Description

FIELD OF THE INVENTION

The invention generally relates to a processing apparatus and more particularly relates to a processing apparatus that contains a number of processing units capable of operating in parallel.

BACKGROUND

Designing a modern microprocessor is a complex task that demands careful balance between cycle times, instruction set architecture, instruction latency, otherwise known as cycle-per-instruction, and finally die area costs. Many traditional microprocessors are designed to execute a single instruction at a time. The processor executes instructions in a serial fashion. This paradigm generally implies a single processing core. The performance of such microprocessors has been improved by two basic approaches. The first of these is the data path width. Over time the data path width has increased from the conceptual 1-bit Turing Machine, to some of the latest 128-bit processors. Second, the performance of the processor has also been improved by increasing the rate at which instructions are executed i.e. the clock frequency has been increased. This increase has taken the logic from 33 MHz to 1.4 GHz in approximately ten years. While the above developments have provided considerable increases in the performance of “serial” microprocessors there are tasks to which they are not well suited.

One task to which serial processors are not well suited is the manipulation of multi-media data. Multi-media data is an example of so-called parallel data. Parallel data is data where the individual data are independent of one another. Such data can be processed in parallel as the manipulation of one datum does not require results from the manipulation of other data. It is also the case that multi-media data generally only requires simple manipulation. This implies that the complexity of the processor can be reduced with the absolute number of processors increasing to process the data in parallel. This has resulted in an evolution towards word-parallel computing, which offers a better balance between cycle time and instruction latency.

One of the major shortcomings with such an approach is the fact that for certain types of processing, namely, multimedia applications, very wide data paths are very often unutilized. To design around this SIMD extensions were introduced, which divide the existing data path into a number of narrower data-paths, such that the instruction could be executed on a number of data samples concurrently. One, widely known example of this is MMX processing unit in Intel's Pentium processor, which is applied to 64-bit data path.

Single Instruction Multiple Data (SIMD) Processing is a processing paradigm that is suited to the processing of parallel multi-media data. The concept of Single Instruction Multiple Data (SIMD) processing architectures have been known for some time. However, historically these processing architectures have encountered problems including high power consumption. This high power consumption is mainly a result of the line resistance associated with the large number of interconnects associated with an array of SIMD processors. The resistance associated with this power consumption also reduces the speed of communications.

Interconnect resistance can also be incurred through the arrangement of SIMD processors and memory. As SIMD processing is generally performed with an array of processors there can be differences in the interconnect length and the memory. One method of mitigating the above issues has been the tight integration of processors and memory as outlined in U.S. Pat. No. 5,956,274 issued on 21 ^stSep. 1999 to Duncan Elliot, et al, ('274 patent).

The '274 patent generally teaches the placement of processors directly adjacent to a memory array and more particularly teaches a configuration that reduces the memory column to processor ratio. The arrangement taught in the above patent greatly reduces resistance and delay issues. Further the arrangement taught in the '274 patent reduces timing problems that are a result of uneven interconnect line length.

SIMD processing architectures are often implemented with 1-bit processors which can result in bit serial processing. This is the case of the '274 patent. However, bit serial processing introduces its own processing problems including: realignment of the data for bit-serial processing (commonly referred to as corner-turning), non-uniform cycle execution time, and increased instruction latency. Another approach for converging on optimal architecture is limiting the scope of applications, reducing the flexibility of such processor.

A desired architecture of SIMD processing includes a balance between the flexibility and the efficiency. Processing units with a variable, dynamically re-configurable, data-path width would allow for greatly improved flexibility at minimal impact on the efficiency. It would be advantageous to be able to adjust the width of the data path of such processing element to the width of the data word required by a given the application, maintaining word-parallel instruction execution.

It is also often the case in SIMD processing that a word whose length is greater than the bit width of the processor must be aligned such that processing occurs in a serial manner with the word now being processed serially through a processing element. This requirement once again forces the processing to be limited by the throughput of the processing element.

Therefore there is a need for a means for creating SIMD based processing units whose data path width can be varied to match the word length of the data word to be processed.

SUMMARY OF INVENTION

According to one aspect of the present invention, there is provided a circuit comprising a processor having a plurality of processor elements, each having an arithmetic logic unit, and a controller for controlling said processor elements, means for providing a respective bit of a multiple bit word to each of the processor elements, and transmission means for enabling signals to be transmitted between said arithmetic logic units, to enable the units to perform a parallel operation on the bits of the multiple bit word.

Advantageously, the processor provides an arrangement of processor elements which can operate together in parallel to process multiple bit data.

In one embodiment, each arithmetic logic unit has an output and an input, and said transmission means includes means for coupling the output of each ALU directly to the input of its adjacent ALU which processes a higher order bit.

In another embodiment, the transmission means includes means for coupling the output of the ALU for processing the most significant bit of the multiple bit word directly to the input of the ALU for processing the least significant bit of the multiple bit word.

In another embodiment, the processor element for processing the MSB of the multiple bit word has input, and the transmission means includes coupling means for coupling the output of the ALU for processing the LSB directly to the input of MSB processing element.

Another aspect of the invention provides a circuit and architecture of processing elements such that the effective arrangement of processing elements can be dynamically altered such that the data path width matches the word length of the data word to be processed.

According to another aspect of the present invention, there is provided a circuit comprising a plurality of computational units, each having at least one processor element and extension circuitry for switchably enabling the transmission of signals between the computational units to enable the units to perform a parallel operation on a multiple bit word wherein at least one bit of said word is provided to each computational unit.

Advantageously, this arrangement enables any number of computational units, each of which is able to process at least one bit of data at a time, to be coupled together to parallel process a multiple bit word, for example whose length is greater than the word that can be parallel processed by an individual computational unit.

In one embodiment, the circuit comprises a first CU and a second CU, said first CU having a plurality of processor elements each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension circuitry is arranged to enable an MSB output from the MSB ALU of the first CU to be transmitted to the input of the LSB ALU of the second CU.

BRIEF DESCRIPTION OF FIGURES

FIG. 1 shows a schematic diagram of a data processor according to an embodiment of the present invention; [0021]
FIG. 2[0022] a is a block diagram illustrating computational units and connections therebetween according to one embodiment of the invention;
FIG. 2[0023] b is a schematic diagram of another embodiment of the invention;
FIG. 3 shows a schematic diagram of a data processor including extension circuitry, according to one embodiment of the invention; [0024]
FIG. 4 is a schematic diagram of a data processor according to an embodiment of the invention, and [0025]
FIG. 5 shows a schematic diagram of an array of computational units and various schemes for interconnecting the units, according to an embodiment of the present invention. [0026]

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

FIG. 1 shows a data processor according to an embodiment of the present invention. The [0027] data processor 1 comprises a computational unit 3 having a plurality of SIMD based processing elements 5, 7, 9, 11, each of which has access to a memory 13. Each processor element 5, 7, 9, 11 has an arithmetic logic unit (ALU) 15, 17, 19, 21 each having an input port 23 and an output port 25. The input port 23 of each ALU is directly coupled to the output port of the neighboring ALU to its right to enable data (e.g. a bit) to propagate from the output 25 to the input 23 of adjacent ALUs. Each processor element further comprises one or more registers 27 and a multiplexer 29 for switchably coupling one or more of a plurality of inputs to a register. In this embodiment, inputs to the multiplexer include the output from its local ALU, the outputs from its neighboring ALUs to its left and right, and an output from the memory 13.
The [0028] data processor 1 has a control circuit 31 (which is also referred to herein as a boundary circuit) having a first input port 33 which is coupled to the output port 25 of the ALU 15 of the first (i.e. left most) processor element 5, a first output port 35 coupled to the input port 23 of the last (i.e. right most) processor element 21, a second input port 37 coupled to the output port 25 of the ALU 21 of the right most processor element 21, and a second output port 39 coupled to one of the input ports of the multiplexer 29 of the left most processor element 5.
In this embodiment, the [0029] control circuit 31 further includes a third input port 41 for receiving external data, for example a set/reset bit (SRB). The control circuit also has a third output port 43 for outputting data which is broadcast to all of the processor elements 5, 7, 9, 11. The broadcast data may for example comprise a set/reset bit, for example received at the third input port 41, a bit output from the first ALU 15, received at the first input port 33, or a bit output from the last ALU 21 and received at the second input port 37.
The [0030] control circuit 31 controls the transmission of data to one or more processor elements within the processor to enable the processor to operate on multiple bit data (or words). For example, the control circuit 31 may be arranged to enable a barrel shift, in which data is shifted from one processor element to its adjacent processor element, either to the right or to the left, and data from the end most processor element towards which the shift is directed is fed by the control circuit 31 to the processor element at the opposite end. Thus, for a left barrel shift, the control circuit 31 is configured to output the data received at the first input port 33 from the ALU of the left most processor element 5 to the input 23 of the ALU 21 of the right most processor element 11. Conversely, for a right barrel shift, the control circuit 31 is configured to pass the data received at its second input port 37 from the output 25 of the right most ALU 21 to the left most processor element 5. As mentioned above, the control circuit 31 may also broadcast data which is common to two or more processor elements to the appropriate processor elements, so that for example the data is transmitted to each processor element simultaneously.
Thus, the [0031] data processor 1 is configurable for processing multiple bit data. Each processor element may comprise a single bit processor, and the data processor may contain any number of one-bit processor elements, for example 2, 4, 8, 16, 32 or any other number.
In operation, the data processor may be arranged such that each of the processor elements receives a single bit of a multiple bit word, e.g. from the [0032] memory 13, or from any other source, such as a different memory or device, so that, for example, the left most processor element 5 receives the most significant bit (MSB) and the right most processor element receives the least significant bit (LSB) of the word to be processed. If read from the memory 13, the multiple bit data may be stored in the memory in such a way that the processor elements receive the data bits in parallel. For example, the memory may contain a plurality of memory segments, each having a read access port 8, which is coupleable to a respective processor element. Each data bit may be stored in a different segment and thereby read out of memory in parallel into the processor elements. After the data has been read into the processor elements, the processor elements are controlled to process the data as a multiple bit word (i.e. a word having an MSB and an LSB). In one embodiment, the data processor may be incorporated in a SIMD processor, and each processor element may be controlled in parallel by an array controller 45. The processor elements may be adapted to be capable of performing multiple step operations and able to store the intermediate results of these operations, thereby reducing the frequency of memory accesses. After a process has been completed, for example on one or more data words, the result of the process (e.g. a multi-bit word) may be written to memory, e.g, the local memory associated with each processor element, another portion of memory or output to another device. Write operations by the processor elements may be controlled in parallel, so that the word bits are output as a unitary word from the processor. Write operations from each processor element may be coordinated and controlled by the array controller 45.
In contrast, in the SIMD processor architecture disclosed in U.S. Pat. No. 5,956,274 (Duncan Elliott), in which a processor element is provided under each column of memory, multiple bit data can only be processed serially by a single processor element, and therefore the data must be read from the memory in series, and the processing element can only process one bit of the serial data at a time. Thus, for write operations to memory, DE's processor either requires additional circuitry, such as a 2D array of registers to enable date to be turned before being written to memory, or requires the rotation of data into a single column to be performed by a number of processor elements equal to the number of bits in the data. [0033]
Returning to the present embodiment, the [0034] control circuit 31 has a write control input port 47 for receiving a write control signal, e.g. a write enable (IWE) signal from the array controller 45. The control circuit 31 may control write operations in response to both the write control signal from the array controller and another state associated with the data processor, for example a state recorded in the control circuit 31. The control circuit 31 has a write control output port 49 for outputting a write enable signal, which in this embodiment may be passed or broadcast to each of the write enable lines 12 (which may be coupled together by a line 51) associated with each I/O memory port 8 for enabling a bit of a multiple bit word to be output by a respective processor element to its respective I/O port for storage in the memory 13. In another embodiment, circuitry may be provided for directing data output by the processor elements to another part of memory, or to another device, and circuitry may enable the data to be selectively directed to one of the local memory and another destination, as disclosed in the applicant's copending applications, Attorney docket Nos. 79135-4 and 79135-5 filed on 4^thMar. 2002, the disclosures of which are incorporated herein by reference.
The data processor may be adapted such that the processor elements can be reconfigured from operating together as a multiple bit word processor to operating individually or separately as independent elements. In this embodiment, the boundary circuit, which passes signals to the processor elements required for the PEs to operate together on multiple bit words would be conditioned or configured to enable the PEs to operate independently. To enable independent write operations from each PE, the write control circuitry, which controls multiple bit word write operations (i.e. when the processor elements are operating in parallel for multiple bit word processing) may be adapted to selectively enable each PE to perform independent write operations. Thus, instead of the write operations of all PEs being controlled by the same write enable signal, each PE would be controlled by a separate write control signal. In one embodiment, the data processor may be dynamically reconfigurable between a multiple independent PE processor, and a multi-bit word parallel processor, so that, for example the operation of the processor can be switched between these two operating modes between successive processes. [0035]
Embodiments of another aspect of the present invention provide a system for grouping SIMD based processing elements into a processing unit whose bit width can be varied to match that of the word being computed. As such it allows for the exchange of signals required for coordination and proper operation of the processing elements that are elements of the processing unit. [0036]
FIG. 2[0037] a is a schematic block diagram of an embodiment of the invention. A data processor 101 comprises a plurality of computational units 103, 105, each of which performs processing functions and contains at least one processing element (which may or may not be a SIMD based processing element). Each computational unit 100 has a bit width equal to the number of processing elements times the bit width of the processing elements. A boundary circuit 107,109 is connected to and associated with each computational unit 103, 105. An extension circuit 111 is located between and connected to the two computational units 103,105 and the two boundary circuits 107,109 associated with the computational units 103, 105. The extension circuit 111 is used to combine computational units to widen the effective data path. For example, the extension circuit allows two N-bit computational units 103, 105 to be combined such that a 2N bit wide processing unit is formed. Each boundary circuit 107, 109 provides for the distribution of signals to its associated computational unit 103,105, as for example described above in connection with the embodiment shown in FIG. 1. The basic repeating unit of circuits is presented as a first grouping 113 and the minimum grouping required for the formation of a 2N bit wide computational unit is shown as a second grouping 115.
Another embodiment of the invention is presented in FIG. 2[0038] b. In this embodiment a memory such as a Random Access Memory (RAM) 117 is directly connected to the computational units 103,105. The memory 117 is further connected to and is accessible through a bus 119. In this embodiment the computational units 103, 105 communicate directly with the memory 117 without having to use the bus 119.
A data processor according to another embodiment of the invention is illustrated in FIG. 3. The [0039] data processor 201 has first and second computational units (CU) 203, 205, each comprising a plurality of single bit processor elements 207, 209, 211, 213. The number of PEs in each computational unit be may selected depending on the application. For example, in one embodiment, each computational unit may contain 8 PEs so that each unit can parallel process 8 bit data. The number of computational units may also depend on the application. For example, if the processor is required to parallel process both 8 bit and 16 bit data, a minimum of two computational units would be required. (However, an 8-bit computational unit may be configured for processing 16-bit data, by processing one byte of the 16-bit data at a time).
Each processor element has an arithmetic logic unit (ALU) [0040] 215, 217, 219, 221 having an input port 223 and an output port 225, and one or more registers 227, which may provide data to one or more other inputs of the ALU. In this embodiment, the PEs in each (CU) are arranged in a one dimensional array and are arranged in an order corresponding to the bit order in a multiple bit word, so that the left most PE is in the MSB position and the right most PE is in the LSB position. The processor elements in the embodiment may be similar to and have any of the features of the processor elements of the embodiment shown in FIG. 1.
Each computational unit ([0041] 203, 205) has an associated boundary circuit 229, 231 for controlling the transmission of data to the processor elements required for operation of the computational unit as a parallel processor. As for the embodiment described above and shown in FIG. 1, the outputs of both the MSB and LSB processor elements of each computational unit 203, 205 are coupleable to its respective boundary circuit 229, 231. Each boundary circuit is also coupleable to output data to the input 223 of its associated LSB PE 209, 213 and to output data to its associated MSB PE 207, 211. The boundary circuit can also broadcast data to a plurality of processor elements e.g. via the O/P port 212 and may receive external data via an I/P port 214, for example from the array controller (not shown).
An [0042] extension circuit 233 is provided to switchably couple the first and second computational units 203, 205 and their associated boundary circuits together to combine their individual word length parallel processing capacity, for example from an individual capacity of 8 bits to a combined capacity of 16 bits. The extension circuitry comprises first, second, third and fourth selector switches 251, 253, 255, 257, (which may comprise multiplexers or any other suitable switch) each having first and second input ports 259, 261, an output port 263 and a control input port 204. The first input port 259 of the first selector switch 251 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205, the second input port 261 is coupled to an output port 266 of the first boundary circuit 229, and the output port 263 of the first selector switch is coupled to the input port 223 of the LSB ALU 217 of the first CU 203, and in this embodiment is capable to an input of the LSB PE 209 of the first CU 203.
The [0043] first input port 259 of the second selector switch 253 is coupled to the output port of the LSB ALU 217 of the first CU 203, the second port is coupled to an output 268 of the second boundary circuit 231, and the output port 263 of the second selector switch is coupled to an input 270 of the MSB processor element 211 of the second CU 205.
The [0044] first input port 259 of the third selector switch 255 is coupled to the output port 225 of the LSB ALU 217 of the first CU 203, the second input port 261 is coupled to the output port 225 of the LSB ALU 221 of the second CU 205, and the output port 263 is coupled to an input 272 of the first boundary circuit 229.
The [0045] first input port 259 of the fourth selector switch 257 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205, the second input port is coupled to the output 225 of the MSB ALU 215 of the first CU 203, and the output port 263 is coupled to an input 274 of the second boundary circuit 231.
A [0046] control signal input 276 is provided for receiving control signals for controlling the selector switches 251, 253, 255, 257, from a controller, such as an array controller for controlling the computational units.
The [0047] extension circuit 223 has a first operating mode or state, in which the first and second CUs are decoupled and have their individual parallel processing capability, and a second operating mode or state, which couples the CUs together to combine their parallel processing capability, i.e. for parallel processing a word having length which is the sum of the lengths of the words that they can parallel process individually.
In the first (i.e. decoupled) mode, the [0048] first selector switch 251 couples the output of the first boundary circuit 266 to the input of the LSB ALU 217 of the first CU 203, the second selector switch 253 couples the output 268 of the second boundary circuit 231 to the input 270 of the MSB PE 211 of the second CU 205, the third selector switch 255 couples the output 225 of the LSB ALU 217 of the first CU 203 to an input 272 of the first boundary circuit 229, and the fourth selector switch 257 couples the output of the MSB ALU 219 of the second CU 205 to an input 274 of the second boundary circuit 231.
In the second, coupled mode, the [0049] first selector switch 251 couples the output 225 of the MSB ALU 219 of the second CU 205 to the input 223 of the LSB ALU 217 of the first CU 203, the second selector switch 253 couples the output port 225 of the LSB ALU 217 of the first CU 203 to an input 270 of the MSB processor element 211 of the second CU 205, the third selector switch 255 couples the output 225 of the LSB ALU 221 of the second computational unit to an input 272 of the first boundary circuit 229, and the fourth selector switch 257 couples the output 225 of the MSB ALU 215 of the first computational unit 203 to the input 274 of the second boundary circuit 231.
Thus, in the coupled mode, the extension circuit provides the required connections to integrate the two arrays of processor elements of the first and second CUs into a parallel processor having the combined number of PEs. The [0050] MSB processor element 207 of the first CU 203 functions as the MSB PE of the extended processor, the LSB processor element 213 of the second CU 205 becomes the LSB PE of the extended processor. The LSB processor element 209 of the first CU and MSB processor element 211 of the second CU become adjacent intermediate PEs in the extended contiguous array of processor elements.
In coupled mode, a bit propagate bus is formed, via the first selector switch, from the output of first ALU of the second CU and the input of the last ALU of the first CU to complete a continuous propagate chain through the series of ALUs of the extended processor, and the output of the first ALU of the first CU is coupled, via the fourth selector switch to an input second boundary circuit. These two connections enable, for example, a left barrel shift in the extended processor, the bit from the [0051] MSB ALU 215 of the first CU being transmitted to the input of the LSB ALU of the second CU via the bus 281, the fourth selector switch 257 and the second boundary circuit.
In coupled mode, a connection is formed between the [0052] output 225 of the last ALU 217 of the first CU and the input 270 of the first PE of the second CU, via the second selector switch 253, to permit bit propagation therebetween, and a bus 283 is formed between the output 225 of the last ALU 221 of the second CU and an input 272 to the first boundary circuit 229, via the third selector switch 255. These connections provide the required connections, for example, for a right barrel shift, the bit output from the LSB ALU of the extended processor being transmitted to an input 278 of the MSB processor element via the bus 283, the third switch 255 and the first boundary circuit 229.
FIG. 4 shows a schematic diagram of a boundary circuit according to an embodiment of the present invention in more detail. Referring to FIG. 4, two [0053] boundary circuits 307, 309 are shown, together with their respective computational units 303, 305, and an extension circuit 311 for coupling the computational units and the boundary circuits together to form an extended parallel processor. The figure also shows part of two further extension circuits 315, 317, one to the left of the first CU and one to the right of the second CU, illustrating that the CUs may be part of an extended array of any number of computational units.
Each [0054] boundary circuit 307, 309 has first, second and third multiplexers 319, 321, 323, an AND gate 325 and a plurality of registers 327, 329, 331, 333, and 335.
Each boundary circuit is connected to an [0055] MSB bus 337, for carrying MSB signals either from the MSB ALU of its associated CU, if the CU is operating independently (or it functions as the left most CU of a coupled CU system and therefore carries the MSB of the extended processor), or the MSB bus carries the MSB from the output of the MSB ALU in a composite CU system. Similarly, each boundary circuit is connected to an LSB bus 339, for carrying LSB signals either from the LSB ALU of its associated CU, if the CU is operating independently (or it functions as the right most CU of a coupled CU system), or the LSB ALU carries the LSB from the output of the LSB ALU in a composite CU system.
Each boundary circuit is connected to a common General Purpose Input (GPI) bit [0056] line 341, which carries SRB signals from an array controller for controlling operations of the CU's. Each CU is also connected to a common write control bus 343 for carrying write enable signals from the array controller, for controlling write operations from the CU's.
Each boundary circuit has a bus [0057] 345 connected between the output of the MSB ALU and the input of the LSB ALU, via the first selector switch 351 of the extension circuit, for carrying a signal designated IAO. This bus 345 is connected to the output of the first multiplexer 319, whose inputs are connected to the MSB, LSB, GPI buses, and an output of each of the five registers. The inputs of the five registers are also coupled for receiving MSB, LSB and GPI signals from MSB, LSB and SRB buses via the third multiplexer 323. Thus, the IAO signal can be any of the MSB or LSB of the local CU, the MSB or LSB of a composite CU system, or an SRB signal.
The [0058] second multiplexer 321 is coupled for receiving the GPI signal and signals from the third, fourth and fifth registers (which can latch the SRB signal, the local or composite system MSB and LSB signals), and can broadcast any of these signals (which may be referred to as a broadcast bit, BB) to all ALUs of the local CU simultaneously.
The three inputs to the AND gate are connected respectively to the WRITE ENABLE [0059] control signal bus 343, and the output of the first and second registers 327, 329, and the output of the AND gate is used to control write operations from the CU processor elements, for example to memory. It is to be noted that write operations are not only controlled by the array controller, but also by the local CU, via the boundary circuit, and in this embodiment are conditional on the content/state of both first and second registers.
In this embodiment, [0060] additional logic 390 is provided for receiving the contents of the second register of each boundary circuit and performing a logical AND operation on the output of all boundary circuits. The output of this Global AND operation can be used to control further operations of the processor.
Similarly, in this embodiment, additional logic is provided for receiving the contents of the third register of each boundary circuit and performing a logical OR operation on the output of all boundary circuits. Again, the output of this Global OR operation can be used to control further operations of processor. For example, the signal may be used to indicate that one of the CU's has reached a predetermined condition, and that further processing by the other CU's is not required. The output of the Global OR may then be used by the processor to terminate processing. [0061]
The extension circuit may be controlled by the array controller, and may be controlled dynamically in response to the length of the word to be processed to extend or contract the number of CUs required to operate together. [0062]
Any number of CUs can be combined to extend the length (i.e. number of bits) of the word that can be processed. For example, a plurality of eight bit computational units may be combined to form one or more 16-bit processors, one or more 32-bit processors, one or more 64-bit processors or, one or more 128-bit processors etc. Each CU has at least one processor element they may be a 1-bit processor element or 2 or more bit processor elements, and may have any number of PE's. Different CUs may contain the same number of PEs, or different number of PEs. Thus, any number of CU's having any number of PE's may be combined to enable a word of a given length to be parallel processed. [0063]
FIG. 5 shows an array of computational units, in which two or more adjacent computational units may be combined into a composite computational unit through extension circuitry control signals (the extension circuitry is not shown in FIG. 5). In this embodiment, each CU comprises an 8-bit CU and can be controlled to allow the CU's to operate either individually as 8-bit parallel processors, or combined into 16-bit parallel processors or 32-bit parallel processors. [0064]
In operation, when a control signal CUX_SEL is 0, the computational units will operate in 8-bit mode. When the control signal CUX_SEL[2:0] are 0, the CU's will operate in 8-bit mode. When the CUX_SEL[2:1] are 0, and the CUX_SEL[0] is 1, the circuit will operate in 16-bit mode. When the CUX_SEL[2] are 0 and CUX_SEL[1:0] is 1, the circuit will operate in 32-bit mode. When the CUX_SEL[2:0] are 1, all the CUs will be operating together and the circuit will be in 256-bit mode. This embodiment is for illustrative purposes only and any other configurations are possible. [0065]
Modifications and changes to the embodiments described above will be apparent to those skilled in the art. [0066]

Claims

1. A circuit comprising a processor having a plurality of processor elements, each having an arithmetic logic unit, and a controller for controlling said processor elements, means for providing a respective bit of a multiple bit word to each of said processor elements, and transmission means for enabling signals to be transmitted between said arithmetic logic units, to enable said units to perform a parallel operation on the bits of said multiple bit word.

2. A circuit as claimed in claim 1, wherein each arithmetic logic unit has an output and an input, and said transmission means includes means for coupling the output of each ALU directly to the input of its adjacent ALU which processes a higher order bit.

3. A circuit as claimed in claim 1, wherein said transmission means includes means for coupling the output of the ALU for processing the most significant bit of the multiple bit word directly to the input of the ALU for processing the least significant bit of the multiple bit word.

4. A circuit as claimed in claim 1, wherein the processor element for processing the MSB of the multiple bit word has an input, and the transmission means includes coupling means for coupling the output of the ALU for processing the LSB directly to the input of MSB processing element.

5. A circuit as claimed in claim 1, further comprising a memory coupleable to each of said processor elements.

6. A die containing a circuit as claimed in claim 1.

7. A circuit comprising a plurality of computational units, each having at least one processor element and extension means for switchably enabling the transmission of signals between said computational units to enable said units to perform a parallel operation on a multiple bit word wherein at least one bit of said word is provided to each computational unit.

8. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the highest order bit of a computational unit to another computational unit that processes the next higher order bit.

9. A circuit as claimed in claim 8, wherein said extension means is capable of coupling said output to the input of the processor element that processes the next higher order bit of said other computational unit.

10. A circuit as claimed in claim 9, wherein said processor element that processes said highest order bit includes an arithmetic logic unit, and said output comprises the output of said arithmetic logic unit.

11. A circuit as claimed in claim 9, wherein said processor element that processes said higher order bit has an arithmetic logic unit, and said input comprises an input to said arithmetic logic unit.

12. A circuit as claimed in claim 7, wherein said extension means includes a selector switch for selectively coupling said input to one of said output and a port capable of receiving a bit from said other computational unit, or another computational unit.

13. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the lowest order bit of a computational unit to another computational unit that processes the next lower order bit.

14. A circuit as claimed in claim 13, wherein said extension means is capable of coupling said output to the input of the processor element that processes the next lower order bit of said other computational unit.

15. A circuit as claimed in claim 14, wherein said processor element that processes said lowest order bit includes an arithmetic logic unit, and said output comprises the output of said arithmetic logic unit.

16. A circuit as claimed in claim 15, wherein said extension means includes a selector switch for selectively coupling said input to one of said output and a port for receiving another bit either from said other computational unit that processes the lower order bit or from another computational unit.

17. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the most significant bit of the multi-bit word to at least one other computational unit that processes one or more bits of said multi-bit word.

18. A circuit as claimed in claim 17, wherein said extension means is capable of coupling the output of said processor element that processes the most significant bit of the multi-bit word to the computational unit that processes the least significant bit of the multi-bit word.

19. A circuit as claimed in claim 7, wherein extension means includes a selector switch for selectively coupling one of the output of the processor element of a computational unit that processes the highest order bit of that computational unit and an output of the processor element that processes the highest order bit of another computational unit, the highest order bit of the other computational unit being higher than the highest order bit of the one computational unit, to an output port coupled to said one computational unit.

20. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the least significant bit of the multi-bit word to at least one other computational unit that processes one or more bits of said multi-bit word.

21. A circuit as claimed in claim 20, wherein said extension means is capable of coupling the output of the processor element that processes the least significant bit of the multi-bit word to the computational unit that processes the most significant bit of the multiple bit word.

22. A circuit as claimed in claim 7, comprising a selector switch for selectively coupling one of the output of the processor element that processes the lowest order bit of a computational unit and the output of a processor element that processes the lowest order bit of another computational unit, wherein the lowest order bit of the other computational unit is lower than the one computational unit, to a port coupled to the one computational unit or to a computational unit that processes one or more higher order bits than the one computational unit.

23. A circuit as claimed in claim 7, comprising a first CU and a second CU, said first CU having a plurality of processor elements each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the MSB ALU of the first CU to be transmitted to the input of the LSB ALU of the second CU.

24. A circuit as claimed in claim 7, comprising a first CU and a second CU, said second CU having a plurality of processor elements, each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the MSB ALU of the second CU to be transmitted to the input of the LSB ALU of the first CU.

25. A circuit as claimed in claim 7, comprising a first CU and a second CU, the first CU having a plurality of processor elements, each having arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the LSB ALU of the first CU to be transmitted to the second CU.

26. A circuit as claimed in claim 25, wherein said second CU comprises a plurality of processing elements, and said extension means is arranged to enable a bit from said LSB ALU of said first CU to be transmitted to the input of the MSB ALU of said second CU.

27. A circuit as claimed in claim 7, comprising a first CU and a second CU, said first and second CUs having a plurality of processor elements, each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means enables the output of the MSB ALU of the second CU to be coupled to the input of the LSB ALU of the first CU.