US20040254965A1 - Apparatus for variable word length computing in an array processor - Google Patents

Apparatus for variable word length computing in an array processor Download PDF

Info

Publication number
US20040254965A1
US20040254965A1 US10/469,518 US46951804A US2004254965A1 US 20040254965 A1 US20040254965 A1 US 20040254965A1 US 46951804 A US46951804 A US 46951804A US 2004254965 A1 US2004254965 A1 US 2004254965A1
Authority
US
United States
Prior art keywords
bit
output
circuit
processes
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/469,518
Inventor
Eric Giernalczyk
Malcolm Stewart
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MtekVision Co Ltd
Original Assignee
ATSANA SEMICONDUCTOR CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ATSANA SEMICONDUCTOR CORP filed Critical ATSANA SEMICONDUCTOR CORP
Priority to US10/469,518 priority Critical patent/US20040254965A1/en
Publication of US20040254965A1 publication Critical patent/US20040254965A1/en
Assigned to ATSANA SEMICONDUCTOR CORP. reassignment ATSANA SEMICONDUCTOR CORP. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: STEWART, MALCOLM, GIERNALCZYK, ERIC
Assigned to MTEKVISION CO., LTD. reassignment MTEKVISION CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ATSANA SEMICONDUCTOR CORP.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17337Direct connection machines, e.g. completely connected computers, point to point communication networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3812Devices capable of handling different types of numbers
    • G06F2207/382Reconfigurable for different fixed word lengths
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3828Multigauge devices, i.e. capable of handling packed numbers without unpacking them

Definitions

  • the invention generally relates to a processing apparatus and more particularly relates to a processing apparatus that contains a number of processing units capable of operating in parallel.
  • Multi-media data is an example of so-called parallel data.
  • Parallel data is data where the individual data are independent of one another. Such data can be processed in parallel as the manipulation of one datum does not require results from the manipulation of other data. It is also the case that multi-media data generally only requires simple manipulation. This implies that the complexity of the processor can be reduced with the absolute number of processors increasing to process the data in parallel. This has resulted in an evolution towards word-parallel computing, which offers a better balance between cycle time and instruction latency.
  • SIMD Processing is a processing paradigm that is suited to the processing of parallel multi-media data.
  • SIMD Single Instruction Multiple Data
  • the concept of Single Instruction Multiple Data (SIMD) processing architectures have been known for some time.
  • these processing architectures have encountered problems including high power consumption. This high power consumption is mainly a result of the line resistance associated with the large number of interconnects associated with an array of SIMD processors. The resistance associated with this power consumption also reduces the speed of communications.
  • Interconnect resistance can also be incurred through the arrangement of SIMD processors and memory. As SIMD processing is generally performed with an array of processors there can be differences in the interconnect length and the memory.
  • One method of mitigating the above issues has been the tight integration of processors and memory as outlined in U.S. Pat. No. 5,956,274 issued on 21 st Sep. 1999 to Duncan Elliot, et al, ('274 patent).
  • the '274 patent generally teaches the placement of processors directly adjacent to a memory array and more particularly teaches a configuration that reduces the memory column to processor ratio.
  • the arrangement taught in the above patent greatly reduces resistance and delay issues. Further the arrangement taught in the '274 patent reduces timing problems that are a result of uneven interconnect line length.
  • SIMD processing architectures are often implemented with 1-bit processors which can result in bit serial processing. This is the case of the '274 patent.
  • bit serial processing introduces its own processing problems including: realignment of the data for bit-serial processing (commonly referred to as corner-turning), non-uniform cycle execution time, and increased instruction latency.
  • corner-turning commonly referred to as corner-turning
  • non-uniform cycle execution time commonly referred to as instruction latency.
  • Another approach for converging on optimal architecture is limiting the scope of applications, reducing the flexibility of such processor.
  • a desired architecture of SIMD processing includes a balance between the flexibility and the efficiency.
  • Processing units with a variable, dynamically re-configurable, data-path width would allow for greatly improved flexibility at minimal impact on the efficiency. It would be advantageous to be able to adjust the width of the data path of such processing element to the width of the data word required by a given the application, maintaining word-parallel instruction execution.
  • a circuit comprising a processor having a plurality of processor elements, each having an arithmetic logic unit, and a controller for controlling said processor elements, means for providing a respective bit of a multiple bit word to each of the processor elements, and transmission means for enabling signals to be transmitted between said arithmetic logic units, to enable the units to perform a parallel operation on the bits of the multiple bit word.
  • the processor provides an arrangement of processor elements which can operate together in parallel to process multiple bit data.
  • each arithmetic logic unit has an output and an input
  • said transmission means includes means for coupling the output of each ALU directly to the input of its adjacent ALU which processes a higher order bit.
  • the transmission means includes means for coupling the output of the ALU for processing the most significant bit of the multiple bit word directly to the input of the ALU for processing the least significant bit of the multiple bit word.
  • the processor element for processing the MSB of the multiple bit word has input, and the transmission means includes coupling means for coupling the output of the ALU for processing the LSB directly to the input of MSB processing element.
  • Another aspect of the invention provides a circuit and architecture of processing elements such that the effective arrangement of processing elements can be dynamically altered such that the data path width matches the word length of the data word to be processed.
  • a circuit comprising a plurality of computational units, each having at least one processor element and extension circuitry for switchably enabling the transmission of signals between the computational units to enable the units to perform a parallel operation on a multiple bit word wherein at least one bit of said word is provided to each computational unit.
  • this arrangement enables any number of computational units, each of which is able to process at least one bit of data at a time, to be coupled together to parallel process a multiple bit word, for example whose length is greater than the word that can be parallel processed by an individual computational unit.
  • the circuit comprises a first CU and a second CU, said first CU having a plurality of processor elements each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension circuitry is arranged to enable an MSB output from the MSB ALU of the first CU to be transmitted to the input of the LSB ALU of the second CU.
  • FIG. 1 shows a schematic diagram of a data processor according to an embodiment of the present invention
  • FIG. 2 a is a block diagram illustrating computational units and connections therebetween according to one embodiment of the invention.
  • FIG. 2 b is a schematic diagram of another embodiment of the invention.
  • FIG. 3 shows a schematic diagram of a data processor including extension circuitry, according to one embodiment of the invention
  • FIG. 4 is a schematic diagram of a data processor according to an embodiment of the invention.
  • FIG. 5 shows a schematic diagram of an array of computational units and various schemes for interconnecting the units, according to an embodiment of the present invention.
  • FIG. 1 shows a data processor according to an embodiment of the present invention.
  • the data processor 1 comprises a computational unit 3 having a plurality of SIMD based processing elements 5 , 7 , 9 , 11 , each of which has access to a memory 13 .
  • Each processor element 5 , 7 , 9 , 11 has an arithmetic logic unit (ALU) 15 , 17 , 19 , 21 each having an input port 23 and an output port 25 .
  • ALU arithmetic logic unit
  • the input port 23 of each ALU is directly coupled to the output port of the neighboring ALU to its right to enable data (e.g. a bit) to propagate from the output 25 to the input 23 of adjacent ALUs.
  • Each processor element further comprises one or more registers 27 and a multiplexer 29 for switchably coupling one or more of a plurality of inputs to a register.
  • inputs to the multiplexer include the output from its local ALU, the outputs from its neighboring ALUs to its left and right, and an output from the memory 13 .
  • the data processor 1 has a control circuit 31 (which is also referred to herein as a boundary circuit) having a first input port 33 which is coupled to the output port 25 of the ALU 15 of the first (i.e. left most) processor element 5 , a first output port 35 coupled to the input port 23 of the last (i.e. right most) processor element 21 , a second input port 37 coupled to the output port 25 of the ALU 21 of the right most processor element 21 , and a second output port 39 coupled to one of the input ports of the multiplexer 29 of the left most processor element 5 .
  • a control circuit 31 which is also referred to herein as a boundary circuit
  • control circuit 31 further includes a third input port 41 for receiving external data, for example a set/reset bit (SRB).
  • the control circuit also has a third output port 43 for outputting data which is broadcast to all of the processor elements 5 , 7 , 9 , 11 .
  • the broadcast data may for example comprise a set/reset bit, for example received at the third input port 41 , a bit output from the first ALU 15 , received at the first input port 33 , or a bit output from the last ALU 21 and received at the second input port 37 .
  • the control circuit 31 controls the transmission of data to one or more processor elements within the processor to enable the processor to operate on multiple bit data (or words).
  • the control circuit 31 may be arranged to enable a barrel shift, in which data is shifted from one processor element to its adjacent processor element, either to the right or to the left, and data from the end most processor element towards which the shift is directed is fed by the control circuit 31 to the processor element at the opposite end.
  • the control circuit 31 is configured to output the data received at the first input port 33 from the ALU of the left most processor element 5 to the input 23 of the ALU 21 of the right most processor element 11 .
  • control circuit 31 is configured to pass the data received at its second input port 37 from the output 25 of the right most ALU 21 to the left most processor element 5 .
  • control circuit 31 may also broadcast data which is common to two or more processor elements to the appropriate processor elements, so that for example the data is transmitted to each processor element simultaneously.
  • the data processor 1 is configurable for processing multiple bit data.
  • Each processor element may comprise a single bit processor, and the data processor may contain any number of one-bit processor elements, for example 2 , 4 , 8 , 16 , 32 or any other number.
  • the data processor may be arranged such that each of the processor elements receives a single bit of a multiple bit word, e.g. from the memory 13 , or from any other source, such as a different memory or device, so that, for example, the left most processor element 5 receives the most significant bit (MSB) and the right most processor element receives the least significant bit (LSB) of the word to be processed.
  • the multiple bit data may be stored in the memory in such a way that the processor elements receive the data bits in parallel.
  • the memory may contain a plurality of memory segments, each having a read access port 8 , which is coupleable to a respective processor element.
  • Each data bit may be stored in a different segment and thereby read out of memory in parallel into the processor elements.
  • the processor elements are controlled to process the data as a multiple bit word (i.e. a word having an MSB and an LSB).
  • the data processor may be incorporated in a SIMD processor, and each processor element may be controlled in parallel by an array controller 45 .
  • the processor elements may be adapted to be capable of performing multiple step operations and able to store the intermediate results of these operations, thereby reducing the frequency of memory accesses.
  • a multi-bit word may be written to memory, e.g, the local memory associated with each processor element, another portion of memory or output to another device.
  • Write operations by the processor elements may be controlled in parallel, so that the word bits are output as a unitary word from the processor.
  • Write operations from each processor element may be coordinated and controlled by the array controller 45 .
  • the control circuit 31 has a write control input port 47 for receiving a write control signal, e.g. a write enable (IWE) signal from the array controller 45 .
  • the control circuit 31 may control write operations in response to both the write control signal from the array controller and another state associated with the data processor, for example a state recorded in the control circuit 31 .
  • the control circuit 31 has a write control output port 49 for outputting a write enable signal, which in this embodiment may be passed or broadcast to each of the write enable lines 12 (which may be coupled together by a line 51 ) associated with each I/O memory port 8 for enabling a bit of a multiple bit word to be output by a respective processor element to its respective I/O port for storage in the memory 13 .
  • circuitry may be provided for directing data output by the processor elements to another part of memory, or to another device, and circuitry may enable the data to be selectively directed to one of the local memory and another destination, as disclosed in the applicant's copending applications, Attorney docket Nos. 79135-4 and 79135-5 filed on 4 th Mar. 2002, the disclosures of which are incorporated herein by reference.
  • the data processor may be adapted such that the processor elements can be reconfigured from operating together as a multiple bit word processor to operating individually or separately as independent elements.
  • the boundary circuit which passes signals to the processor elements required for the PEs to operate together on multiple bit words would be conditioned or configured to enable the PEs to operate independently.
  • the write control circuitry which controls multiple bit word write operations (i.e. when the processor elements are operating in parallel for multiple bit word processing) may be adapted to selectively enable each PE to perform independent write operations.
  • the data processor may be dynamically reconfigurable between a multiple independent PE processor, and a multi-bit word parallel processor, so that, for example the operation of the processor can be switched between these two operating modes between successive processes.
  • Embodiments of another aspect of the present invention provide a system for grouping SIMD based processing elements into a processing unit whose bit width can be varied to match that of the word being computed. As such it allows for the exchange of signals required for coordination and proper operation of the processing elements that are elements of the processing unit.
  • FIG. 2 a is a schematic block diagram of an embodiment of the invention.
  • a data processor 101 comprises a plurality of computational units 103 , 105 , each of which performs processing functions and contains at least one processing element (which may or may not be a SIMD based processing element).
  • Each computational unit 100 has a bit width equal to the number of processing elements times the bit width of the processing elements.
  • a boundary circuit 107 , 109 is connected to and associated with each computational unit 103 , 105 .
  • An extension circuit 111 is located between and connected to the two computational units 103 , 105 and the two boundary circuits 107 , 109 associated with the computational units 103 , 105 . The extension circuit 111 is used to combine computational units to widen the effective data path.
  • the extension circuit allows two N-bit computational units 103 , 105 to be combined such that a 2N bit wide processing unit is formed.
  • Each boundary circuit 107 , 109 provides for the distribution of signals to its associated computational unit 103 , 105 , as for example described above in connection with the embodiment shown in FIG. 1.
  • the basic repeating unit of circuits is presented as a first grouping 113 and the minimum grouping required for the formation of a 2N bit wide computational unit is shown as a second grouping 115 .
  • FIG. 2 b Another embodiment of the invention is presented in FIG. 2 b .
  • a memory such as a Random Access Memory (RAM) 117 is directly connected to the computational units 103 , 105 .
  • the memory 117 is further connected to and is accessible through a bus 119 .
  • the computational units 103 , 105 communicate directly with the memory 117 without having to use the bus 119 .
  • RAM Random Access Memory
  • the data processor 201 has first and second computational units (CU) 203 , 205 , each comprising a plurality of single bit processor elements 207 , 209 , 211 , 213 .
  • the number of PEs in each computational unit be may selected depending on the application. For example, in one embodiment, each computational unit may contain 8 PEs so that each unit can parallel process 8 bit data.
  • the number of computational units may also depend on the application. For example, if the processor is required to parallel process both 8 bit and 16 bit data, a minimum of two computational units would be required. (However, an 8-bit computational unit may be configured for processing 16-bit data, by processing one byte of the 16-bit data at a time).
  • Each processor element has an arithmetic logic unit (ALU) 215 , 217 , 219 , 221 having an input port 223 and an output port 225 , and one or more registers 227 , which may provide data to one or more other inputs of the ALU.
  • ALU arithmetic logic unit
  • the PEs in each (CU) are arranged in a one dimensional array and are arranged in an order corresponding to the bit order in a multiple bit word, so that the left most PE is in the MSB position and the right most PE is in the LSB position.
  • the processor elements in the embodiment may be similar to and have any of the features of the processor elements of the embodiment shown in FIG. 1.
  • Each computational unit ( 203 , 205 ) has an associated boundary circuit 229 , 231 for controlling the transmission of data to the processor elements required for operation of the computational unit as a parallel processor.
  • the outputs of both the MSB and LSB processor elements of each computational unit 203 , 205 are coupleable to its respective boundary circuit 229 , 231 .
  • Each boundary circuit is also coupleable to output data to the input 223 of its associated LSB PE 209 , 213 and to output data to its associated MSB PE 207 , 211 .
  • the boundary circuit can also broadcast data to a plurality of processor elements e.g. via the O/P port 212 and may receive external data via an I/P port 214 , for example from the array controller (not shown).
  • An extension circuit 233 is provided to switchably couple the first and second computational units 203 , 205 and their associated boundary circuits together to combine their individual word length parallel processing capacity, for example from an individual capacity of 8 bits to a combined capacity of 16 bits.
  • the extension circuitry comprises first, second, third and fourth selector switches 251 , 253 , 255 , 257 , (which may comprise multiplexers or any other suitable switch) each having first and second input ports 259 , 261 , an output port 263 and a control input port 204 .
  • the first input port 259 of the first selector switch 251 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205
  • the second input port 261 is coupled to an output port 266 of the first boundary circuit 229
  • the output port 263 of the first selector switch is coupled to the input port 223 of the LSB ALU 217 of the first CU 203 , and in this embodiment is capable to an input of the LSB PE 209 of the first CU 203 .
  • the first input port 259 of the second selector switch 253 is coupled to the output port of the LSB ALU 217 of the first CU 203 , the second port is coupled to an output 268 of the second boundary circuit 231 , and the output port 263 of the second selector switch is coupled to an input 270 of the MSB processor element 211 of the second CU 205 .
  • the first input port 259 of the third selector switch 255 is coupled to the output port 225 of the LSB ALU 217 of the first CU 203
  • the second input port 261 is coupled to the output port 225 of the LSB ALU 221 of the second CU 205
  • the output port 263 is coupled to an input 272 of the first boundary circuit 229 .
  • the first input port 259 of the fourth selector switch 257 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205 , the second input port is coupled to the output 225 of the MSB ALU 215 of the first CU 203 , and the output port 263 is coupled to an input 274 of the second boundary circuit 231 .
  • a control signal input 276 is provided for receiving control signals for controlling the selector switches 251 , 253 , 255 , 257 , from a controller, such as an array controller for controlling the computational units.
  • the extension circuit 223 has a first operating mode or state, in which the first and second CUs are decoupled and have their individual parallel processing capability, and a second operating mode or state, which couples the CUs together to combine their parallel processing capability, i.e. for parallel processing a word having length which is the sum of the lengths of the words that they can parallel process individually.
  • the first selector switch 251 couples the output of the first boundary circuit 266 to the input of the LSB ALU 217 of the first CU 203
  • the second selector switch 253 couples the output 268 of the second boundary circuit 231 to the input 270 of the MSB PE 211 of the second CU 205
  • the third selector switch 255 couples the output 225 of the LSB ALU 217 of the first CU 203 to an input 272 of the first boundary circuit 229
  • the fourth selector switch 257 couples the output of the MSB ALU 219 of the second CU 205 to an input 274 of the second boundary circuit 231 .
  • the first selector switch 251 couples the output 225 of the MSB ALU 219 of the second CU 205 to the input 223 of the LSB ALU 217 of the first CU 203
  • the second selector switch 253 couples the output port 225 of the LSB ALU 217 of the first CU 203 to an input 270 of the MSB processor element 211 of the second CU 205
  • the third selector switch 255 couples the output 225 of the LSB ALU 221 of the second computational unit to an input 272 of the first boundary circuit 229
  • the fourth selector switch 257 couples the output 225 of the MSB ALU 215 of the first computational unit 203 to the input 274 of the second boundary circuit 231 .
  • the extension circuit provides the required connections to integrate the two arrays of processor elements of the first and second CUs into a parallel processor having the combined number of PEs.
  • the MSB processor element 207 of the first CU 203 functions as the MSB PE of the extended processor
  • the LSB processor element 213 of the second CU 205 becomes the LSB PE of the extended processor.
  • the LSB processor element 209 of the first CU and MSB processor element 211 of the second CU become adjacent intermediate PEs in the extended contiguous array of processor elements.
  • a bit propagate bus is formed, via the first selector switch, from the output of first ALU of the second CU and the input of the last ALU of the first CU to complete a continuous propagate chain through the series of ALUs of the extended processor, and the output of the first ALU of the first CU is coupled, via the fourth selector switch to an input second boundary circuit.
  • These two connections enable, for example, a left barrel shift in the extended processor, the bit from the MSB ALU 215 of the first CU being transmitted to the input of the LSB ALU of the second CU via the bus 281 , the fourth selector switch 257 and the second boundary circuit.
  • a connection is formed between the output 225 of the last ALU 217 of the first CU and the input 270 of the first PE of the second CU, via the second selector switch 253 , to permit bit propagation therebetween, and a bus 283 is formed between the output 225 of the last ALU 221 of the second CU and an input 272 to the first boundary circuit 229 , via the third selector switch 255 .
  • These connections provide the required connections, for example, for a right barrel shift, the bit output from the LSB ALU of the extended processor being transmitted to an input 278 of the MSB processor element via the bus 283 , the third switch 255 and the first boundary circuit 229 .
  • FIG. 4 shows a schematic diagram of a boundary circuit according to an embodiment of the present invention in more detail.
  • two boundary circuits 307 , 309 are shown, together with their respective computational units 303 , 305 , and an extension circuit 311 for coupling the computational units and the boundary circuits together to form an extended parallel processor.
  • the figure also shows part of two further extension circuits 315 , 317 , one to the left of the first CU and one to the right of the second CU, illustrating that the CUs may be part of an extended array of any number of computational units.
  • Each boundary circuit 307 , 309 has first, second and third multiplexers 319 , 321 , 323 , an AND gate 325 and a plurality of registers 327 , 329 , 331 , 333 , and 335 .
  • Each boundary circuit is connected to an MSB bus 337 , for carrying MSB signals either from the MSB ALU of its associated CU, if the CU is operating independently (or it functions as the left most CU of a coupled CU system and therefore carries the MSB of the extended processor), or the MSB bus carries the MSB from the output of the MSB ALU in a composite CU system.
  • each boundary circuit is connected to an LSB bus 339 , for carrying LSB signals either from the LSB ALU of its associated CU, if the CU is operating independently (or it functions as the right most CU of a coupled CU system), or the LSB ALU carries the LSB from the output of the LSB ALU in a composite CU system.
  • Each boundary circuit is connected to a common General Purpose Input (GPI) bit line 341 , which carries SRB signals from an array controller for controlling operations of the CU's.
  • Each CU is also connected to a common write control bus 343 for carrying write enable signals from the array controller, for controlling write operations from the CU's.
  • GPI General Purpose Input
  • Each boundary circuit has a bus 345 connected between the output of the MSB ALU and the input of the LSB ALU, via the first selector switch 351 of the extension circuit, for carrying a signal designated IAO.
  • This bus 345 is connected to the output of the first multiplexer 319 , whose inputs are connected to the MSB, LSB, GPI buses, and an output of each of the five registers.
  • the inputs of the five registers are also coupled for receiving MSB, LSB and GPI signals from MSB, LSB and SRB buses via the third multiplexer 323 .
  • the IAO signal can be any of the MSB or LSB of the local CU, the MSB or LSB of a composite CU system, or an SRB signal.
  • the second multiplexer 321 is coupled for receiving the GPI signal and signals from the third, fourth and fifth registers (which can latch the SRB signal, the local or composite system MSB and LSB signals), and can broadcast any of these signals (which may be referred to as a broadcast bit, BB) to all ALUs of the local CU simultaneously.
  • BB broadcast bit
  • the three inputs to the AND gate are connected respectively to the WRITE ENABLE control signal bus 343 , and the output of the first and second registers 327 , 329 , and the output of the AND gate is used to control write operations from the CU processor elements, for example to memory. It is to be noted that write operations are not only controlled by the array controller, but also by the local CU, via the boundary circuit, and in this embodiment are conditional on the content/state of both first and second registers.
  • additional logic 390 is provided for receiving the contents of the second register of each boundary circuit and performing a logical AND operation on the output of all boundary circuits.
  • the output of this Global AND operation can be used to control further operations of the processor.
  • additional logic is provided for receiving the contents of the third register of each boundary circuit and performing a logical OR operation on the output of all boundary circuits.
  • the output of this Global OR operation can be used to control further operations of processor.
  • the signal may be used to indicate that one of the CU's has reached a predetermined condition, and that further processing by the other CU's is not required.
  • the output of the Global OR may then be used by the processor to terminate processing.
  • the extension circuit may be controlled by the array controller, and may be controlled dynamically in response to the length of the word to be processed to extend or contract the number of CUs required to operate together.
  • Any number of CUs can be combined to extend the length (i.e. number of bits) of the word that can be processed.
  • a plurality of eight bit computational units may be combined to form one or more 16-bit processors, one or more 32-bit processors, one or more 64-bit processors or, one or more 128-bit processors etc.
  • Each CU has at least one processor element they may be a 1-bit processor element or 2 or more bit processor elements, and may have any number of PE's. Different CUs may contain the same number of PEs, or different number of PEs. Thus, any number of CU's having any number of PE's may be combined to enable a word of a given length to be parallel processed.
  • FIG. 5 shows an array of computational units, in which two or more adjacent computational units may be combined into a composite computational unit through extension circuitry control signals (the extension circuitry is not shown in FIG. 5).
  • each CU comprises an 8-bit CU and can be controlled to allow the CU's to operate either individually as 8-bit parallel processors, or combined into 16-bit parallel processors or 32-bit parallel processors.

Abstract

A computational unit comprises a processor having a plurality of processing elements, each having an arithmetic logic unit, and a controller for controlling the processor elements. The processor can provide a respectivef bit of a multiple bit word to each of the processor elements and enables signals to be transmitted between the arithmetic logic units to enable the units to perform a parallel operation on the bits of the multiple bit word. Extension circuitry is provided for selectively coupling one or more computational units together to combine their parallel processing capability.

Description

    FIELD OF THE INVENTION
  • The invention generally relates to a processing apparatus and more particularly relates to a processing apparatus that contains a number of processing units capable of operating in parallel. [0001]
  • BACKGROUND
  • Designing a modern microprocessor is a complex task that demands careful balance between cycle times, instruction set architecture, instruction latency, otherwise known as cycle-per-instruction, and finally die area costs. Many traditional microprocessors are designed to execute a single instruction at a time. The processor executes instructions in a serial fashion. This paradigm generally implies a single processing core. The performance of such microprocessors has been improved by two basic approaches. The first of these is the data path width. Over time the data path width has increased from the conceptual 1-bit Turing Machine, to some of the latest 128-bit processors. Second, the performance of the processor has also been improved by increasing the rate at which instructions are executed i.e. the clock frequency has been increased. This increase has taken the logic from 33 MHz to 1.4 GHz in approximately ten years. While the above developments have provided considerable increases in the performance of “serial” microprocessors there are tasks to which they are not well suited. [0002]
  • One task to which serial processors are not well suited is the manipulation of multi-media data. Multi-media data is an example of so-called parallel data. Parallel data is data where the individual data are independent of one another. Such data can be processed in parallel as the manipulation of one datum does not require results from the manipulation of other data. It is also the case that multi-media data generally only requires simple manipulation. This implies that the complexity of the processor can be reduced with the absolute number of processors increasing to process the data in parallel. This has resulted in an evolution towards word-parallel computing, which offers a better balance between cycle time and instruction latency. [0003]
  • One of the major shortcomings with such an approach is the fact that for certain types of processing, namely, multimedia applications, very wide data paths are very often unutilized. To design around this SIMD extensions were introduced, which divide the existing data path into a number of narrower data-paths, such that the instruction could be executed on a number of data samples concurrently. One, widely known example of this is MMX processing unit in Intel's Pentium processor, which is applied to 64-bit data path. [0004]
  • Single Instruction Multiple Data (SIMD) Processing is a processing paradigm that is suited to the processing of parallel multi-media data. The concept of Single Instruction Multiple Data (SIMD) processing architectures have been known for some time. However, historically these processing architectures have encountered problems including high power consumption. This high power consumption is mainly a result of the line resistance associated with the large number of interconnects associated with an array of SIMD processors. The resistance associated with this power consumption also reduces the speed of communications. [0005]
  • Interconnect resistance can also be incurred through the arrangement of SIMD processors and memory. As SIMD processing is generally performed with an array of processors there can be differences in the interconnect length and the memory. One method of mitigating the above issues has been the tight integration of processors and memory as outlined in U.S. Pat. No. 5,956,274 issued on 21[0006] st Sep. 1999 to Duncan Elliot, et al, ('274 patent).
  • The '274 patent generally teaches the placement of processors directly adjacent to a memory array and more particularly teaches a configuration that reduces the memory column to processor ratio. The arrangement taught in the above patent greatly reduces resistance and delay issues. Further the arrangement taught in the '274 patent reduces timing problems that are a result of uneven interconnect line length. [0007]
  • SIMD processing architectures are often implemented with 1-bit processors which can result in bit serial processing. This is the case of the '274 patent. However, bit serial processing introduces its own processing problems including: realignment of the data for bit-serial processing (commonly referred to as corner-turning), non-uniform cycle execution time, and increased instruction latency. Another approach for converging on optimal architecture is limiting the scope of applications, reducing the flexibility of such processor. [0008]
  • A desired architecture of SIMD processing includes a balance between the flexibility and the efficiency. Processing units with a variable, dynamically re-configurable, data-path width would allow for greatly improved flexibility at minimal impact on the efficiency. It would be advantageous to be able to adjust the width of the data path of such processing element to the width of the data word required by a given the application, maintaining word-parallel instruction execution. [0009]
  • It is also often the case in SIMD processing that a word whose length is greater than the bit width of the processor must be aligned such that processing occurs in a serial manner with the word now being processed serially through a processing element. This requirement once again forces the processing to be limited by the throughput of the processing element. [0010]
  • Therefore there is a need for a means for creating SIMD based processing units whose data path width can be varied to match the word length of the data word to be processed. [0011]
  • SUMMARY OF INVENTION
  • According to one aspect of the present invention, there is provided a circuit comprising a processor having a plurality of processor elements, each having an arithmetic logic unit, and a controller for controlling said processor elements, means for providing a respective bit of a multiple bit word to each of the processor elements, and transmission means for enabling signals to be transmitted between said arithmetic logic units, to enable the units to perform a parallel operation on the bits of the multiple bit word. [0012]
  • Advantageously, the processor provides an arrangement of processor elements which can operate together in parallel to process multiple bit data. [0013]
  • In one embodiment, each arithmetic logic unit has an output and an input, and said transmission means includes means for coupling the output of each ALU directly to the input of its adjacent ALU which processes a higher order bit. [0014]
  • In another embodiment, the transmission means includes means for coupling the output of the ALU for processing the most significant bit of the multiple bit word directly to the input of the ALU for processing the least significant bit of the multiple bit word. [0015]
  • In another embodiment, the processor element for processing the MSB of the multiple bit word has input, and the transmission means includes coupling means for coupling the output of the ALU for processing the LSB directly to the input of MSB processing element. [0016]
  • Another aspect of the invention provides a circuit and architecture of processing elements such that the effective arrangement of processing elements can be dynamically altered such that the data path width matches the word length of the data word to be processed. [0017]
  • According to another aspect of the present invention, there is provided a circuit comprising a plurality of computational units, each having at least one processor element and extension circuitry for switchably enabling the transmission of signals between the computational units to enable the units to perform a parallel operation on a multiple bit word wherein at least one bit of said word is provided to each computational unit. [0018]
  • Advantageously, this arrangement enables any number of computational units, each of which is able to process at least one bit of data at a time, to be coupled together to parallel process a multiple bit word, for example whose length is greater than the word that can be parallel processed by an individual computational unit. [0019]
  • In one embodiment, the circuit comprises a first CU and a second CU, said first CU having a plurality of processor elements each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension circuitry is arranged to enable an MSB output from the MSB ALU of the first CU to be transmitted to the input of the LSB ALU of the second CU.[0020]
  • BRIEF DESCRIPTION OF FIGURES
  • FIG. 1 shows a schematic diagram of a data processor according to an embodiment of the present invention; [0021]
  • FIG. 2[0022] a is a block diagram illustrating computational units and connections therebetween according to one embodiment of the invention;
  • FIG. 2[0023] b is a schematic diagram of another embodiment of the invention;
  • FIG. 3 shows a schematic diagram of a data processor including extension circuitry, according to one embodiment of the invention; [0024]
  • FIG. 4 is a schematic diagram of a data processor according to an embodiment of the invention, and [0025]
  • FIG. 5 shows a schematic diagram of an array of computational units and various schemes for interconnecting the units, according to an embodiment of the present invention. [0026]
  • DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • FIG. 1 shows a data processor according to an embodiment of the present invention. The [0027] data processor 1 comprises a computational unit 3 having a plurality of SIMD based processing elements 5, 7, 9, 11, each of which has access to a memory 13. Each processor element 5, 7, 9, 11 has an arithmetic logic unit (ALU) 15, 17, 19, 21 each having an input port 23 and an output port 25. The input port 23 of each ALU is directly coupled to the output port of the neighboring ALU to its right to enable data (e.g. a bit) to propagate from the output 25 to the input 23 of adjacent ALUs. Each processor element further comprises one or more registers 27 and a multiplexer 29 for switchably coupling one or more of a plurality of inputs to a register. In this embodiment, inputs to the multiplexer include the output from its local ALU, the outputs from its neighboring ALUs to its left and right, and an output from the memory 13.
  • The [0028] data processor 1 has a control circuit 31 (which is also referred to herein as a boundary circuit) having a first input port 33 which is coupled to the output port 25 of the ALU 15 of the first (i.e. left most) processor element 5, a first output port 35 coupled to the input port 23 of the last (i.e. right most) processor element 21, a second input port 37 coupled to the output port 25 of the ALU 21 of the right most processor element 21, and a second output port 39 coupled to one of the input ports of the multiplexer 29 of the left most processor element 5.
  • In this embodiment, the [0029] control circuit 31 further includes a third input port 41 for receiving external data, for example a set/reset bit (SRB). The control circuit also has a third output port 43 for outputting data which is broadcast to all of the processor elements 5, 7, 9, 11. The broadcast data may for example comprise a set/reset bit, for example received at the third input port 41, a bit output from the first ALU 15, received at the first input port 33, or a bit output from the last ALU 21 and received at the second input port 37.
  • The [0030] control circuit 31 controls the transmission of data to one or more processor elements within the processor to enable the processor to operate on multiple bit data (or words). For example, the control circuit 31 may be arranged to enable a barrel shift, in which data is shifted from one processor element to its adjacent processor element, either to the right or to the left, and data from the end most processor element towards which the shift is directed is fed by the control circuit 31 to the processor element at the opposite end. Thus, for a left barrel shift, the control circuit 31 is configured to output the data received at the first input port 33 from the ALU of the left most processor element 5 to the input 23 of the ALU 21 of the right most processor element 11. Conversely, for a right barrel shift, the control circuit 31 is configured to pass the data received at its second input port 37 from the output 25 of the right most ALU 21 to the left most processor element 5. As mentioned above, the control circuit 31 may also broadcast data which is common to two or more processor elements to the appropriate processor elements, so that for example the data is transmitted to each processor element simultaneously.
  • Thus, the [0031] data processor 1 is configurable for processing multiple bit data. Each processor element may comprise a single bit processor, and the data processor may contain any number of one-bit processor elements, for example 2, 4, 8, 16, 32 or any other number.
  • In operation, the data processor may be arranged such that each of the processor elements receives a single bit of a multiple bit word, e.g. from the [0032] memory 13, or from any other source, such as a different memory or device, so that, for example, the left most processor element 5 receives the most significant bit (MSB) and the right most processor element receives the least significant bit (LSB) of the word to be processed. If read from the memory 13, the multiple bit data may be stored in the memory in such a way that the processor elements receive the data bits in parallel. For example, the memory may contain a plurality of memory segments, each having a read access port 8, which is coupleable to a respective processor element. Each data bit may be stored in a different segment and thereby read out of memory in parallel into the processor elements. After the data has been read into the processor elements, the processor elements are controlled to process the data as a multiple bit word (i.e. a word having an MSB and an LSB). In one embodiment, the data processor may be incorporated in a SIMD processor, and each processor element may be controlled in parallel by an array controller 45. The processor elements may be adapted to be capable of performing multiple step operations and able to store the intermediate results of these operations, thereby reducing the frequency of memory accesses. After a process has been completed, for example on one or more data words, the result of the process (e.g. a multi-bit word) may be written to memory, e.g, the local memory associated with each processor element, another portion of memory or output to another device. Write operations by the processor elements may be controlled in parallel, so that the word bits are output as a unitary word from the processor. Write operations from each processor element may be coordinated and controlled by the array controller 45.
  • In contrast, in the SIMD processor architecture disclosed in U.S. Pat. No. 5,956,274 (Duncan Elliott), in which a processor element is provided under each column of memory, multiple bit data can only be processed serially by a single processor element, and therefore the data must be read from the memory in series, and the processing element can only process one bit of the serial data at a time. Thus, for write operations to memory, DE's processor either requires additional circuitry, such as a 2D array of registers to enable date to be turned before being written to memory, or requires the rotation of data into a single column to be performed by a number of processor elements equal to the number of bits in the data. [0033]
  • Returning to the present embodiment, the [0034] control circuit 31 has a write control input port 47 for receiving a write control signal, e.g. a write enable (IWE) signal from the array controller 45. The control circuit 31 may control write operations in response to both the write control signal from the array controller and another state associated with the data processor, for example a state recorded in the control circuit 31. The control circuit 31 has a write control output port 49 for outputting a write enable signal, which in this embodiment may be passed or broadcast to each of the write enable lines 12 (which may be coupled together by a line 51) associated with each I/O memory port 8 for enabling a bit of a multiple bit word to be output by a respective processor element to its respective I/O port for storage in the memory 13. In another embodiment, circuitry may be provided for directing data output by the processor elements to another part of memory, or to another device, and circuitry may enable the data to be selectively directed to one of the local memory and another destination, as disclosed in the applicant's copending applications, Attorney docket Nos. 79135-4 and 79135-5 filed on 4th Mar. 2002, the disclosures of which are incorporated herein by reference.
  • The data processor may be adapted such that the processor elements can be reconfigured from operating together as a multiple bit word processor to operating individually or separately as independent elements. In this embodiment, the boundary circuit, which passes signals to the processor elements required for the PEs to operate together on multiple bit words would be conditioned or configured to enable the PEs to operate independently. To enable independent write operations from each PE, the write control circuitry, which controls multiple bit word write operations (i.e. when the processor elements are operating in parallel for multiple bit word processing) may be adapted to selectively enable each PE to perform independent write operations. Thus, instead of the write operations of all PEs being controlled by the same write enable signal, each PE would be controlled by a separate write control signal. In one embodiment, the data processor may be dynamically reconfigurable between a multiple independent PE processor, and a multi-bit word parallel processor, so that, for example the operation of the processor can be switched between these two operating modes between successive processes. [0035]
  • Embodiments of another aspect of the present invention provide a system for grouping SIMD based processing elements into a processing unit whose bit width can be varied to match that of the word being computed. As such it allows for the exchange of signals required for coordination and proper operation of the processing elements that are elements of the processing unit. [0036]
  • FIG. 2[0037] a is a schematic block diagram of an embodiment of the invention. A data processor 101 comprises a plurality of computational units 103, 105, each of which performs processing functions and contains at least one processing element (which may or may not be a SIMD based processing element). Each computational unit 100 has a bit width equal to the number of processing elements times the bit width of the processing elements. A boundary circuit 107,109 is connected to and associated with each computational unit 103, 105. An extension circuit 111 is located between and connected to the two computational units 103,105 and the two boundary circuits 107,109 associated with the computational units 103, 105. The extension circuit 111 is used to combine computational units to widen the effective data path. For example, the extension circuit allows two N-bit computational units 103, 105 to be combined such that a 2N bit wide processing unit is formed. Each boundary circuit 107, 109 provides for the distribution of signals to its associated computational unit 103,105, as for example described above in connection with the embodiment shown in FIG. 1. The basic repeating unit of circuits is presented as a first grouping 113 and the minimum grouping required for the formation of a 2N bit wide computational unit is shown as a second grouping 115.
  • Another embodiment of the invention is presented in FIG. 2[0038] b. In this embodiment a memory such as a Random Access Memory (RAM) 117 is directly connected to the computational units 103,105. The memory 117 is further connected to and is accessible through a bus 119. In this embodiment the computational units 103, 105 communicate directly with the memory 117 without having to use the bus 119.
  • A data processor according to another embodiment of the invention is illustrated in FIG. 3. The [0039] data processor 201 has first and second computational units (CU) 203, 205, each comprising a plurality of single bit processor elements 207, 209, 211, 213. The number of PEs in each computational unit be may selected depending on the application. For example, in one embodiment, each computational unit may contain 8 PEs so that each unit can parallel process 8 bit data. The number of computational units may also depend on the application. For example, if the processor is required to parallel process both 8 bit and 16 bit data, a minimum of two computational units would be required. (However, an 8-bit computational unit may be configured for processing 16-bit data, by processing one byte of the 16-bit data at a time).
  • Each processor element has an arithmetic logic unit (ALU) [0040] 215, 217, 219, 221 having an input port 223 and an output port 225, and one or more registers 227, which may provide data to one or more other inputs of the ALU. In this embodiment, the PEs in each (CU) are arranged in a one dimensional array and are arranged in an order corresponding to the bit order in a multiple bit word, so that the left most PE is in the MSB position and the right most PE is in the LSB position. The processor elements in the embodiment may be similar to and have any of the features of the processor elements of the embodiment shown in FIG. 1.
  • Each computational unit ([0041] 203, 205) has an associated boundary circuit 229, 231 for controlling the transmission of data to the processor elements required for operation of the computational unit as a parallel processor. As for the embodiment described above and shown in FIG. 1, the outputs of both the MSB and LSB processor elements of each computational unit 203, 205 are coupleable to its respective boundary circuit 229, 231. Each boundary circuit is also coupleable to output data to the input 223 of its associated LSB PE 209, 213 and to output data to its associated MSB PE 207, 211. The boundary circuit can also broadcast data to a plurality of processor elements e.g. via the O/P port 212 and may receive external data via an I/P port 214, for example from the array controller (not shown).
  • An [0042] extension circuit 233 is provided to switchably couple the first and second computational units 203, 205 and their associated boundary circuits together to combine their individual word length parallel processing capacity, for example from an individual capacity of 8 bits to a combined capacity of 16 bits. The extension circuitry comprises first, second, third and fourth selector switches 251, 253, 255, 257, (which may comprise multiplexers or any other suitable switch) each having first and second input ports 259, 261, an output port 263 and a control input port 204. The first input port 259 of the first selector switch 251 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205, the second input port 261 is coupled to an output port 266 of the first boundary circuit 229, and the output port 263 of the first selector switch is coupled to the input port 223 of the LSB ALU 217 of the first CU 203, and in this embodiment is capable to an input of the LSB PE 209 of the first CU 203.
  • The [0043] first input port 259 of the second selector switch 253 is coupled to the output port of the LSB ALU 217 of the first CU 203, the second port is coupled to an output 268 of the second boundary circuit 231, and the output port 263 of the second selector switch is coupled to an input 270 of the MSB processor element 211 of the second CU 205.
  • The [0044] first input port 259 of the third selector switch 255 is coupled to the output port 225 of the LSB ALU 217 of the first CU 203, the second input port 261 is coupled to the output port 225 of the LSB ALU 221 of the second CU 205, and the output port 263 is coupled to an input 272 of the first boundary circuit 229.
  • The [0045] first input port 259 of the fourth selector switch 257 is coupled to the output port 225 of the MSB ALU 219 of the second CU 205, the second input port is coupled to the output 225 of the MSB ALU 215 of the first CU 203, and the output port 263 is coupled to an input 274 of the second boundary circuit 231.
  • A [0046] control signal input 276 is provided for receiving control signals for controlling the selector switches 251, 253, 255, 257, from a controller, such as an array controller for controlling the computational units.
  • The [0047] extension circuit 223 has a first operating mode or state, in which the first and second CUs are decoupled and have their individual parallel processing capability, and a second operating mode or state, which couples the CUs together to combine their parallel processing capability, i.e. for parallel processing a word having length which is the sum of the lengths of the words that they can parallel process individually.
  • In the first (i.e. decoupled) mode, the [0048] first selector switch 251 couples the output of the first boundary circuit 266 to the input of the LSB ALU 217 of the first CU 203, the second selector switch 253 couples the output 268 of the second boundary circuit 231 to the input 270 of the MSB PE 211 of the second CU 205, the third selector switch 255 couples the output 225 of the LSB ALU 217 of the first CU 203 to an input 272 of the first boundary circuit 229, and the fourth selector switch 257 couples the output of the MSB ALU 219 of the second CU 205 to an input 274 of the second boundary circuit 231.
  • In the second, coupled mode, the [0049] first selector switch 251 couples the output 225 of the MSB ALU 219 of the second CU 205 to the input 223 of the LSB ALU 217 of the first CU 203, the second selector switch 253 couples the output port 225 of the LSB ALU 217 of the first CU 203 to an input 270 of the MSB processor element 211 of the second CU 205, the third selector switch 255 couples the output 225 of the LSB ALU 221 of the second computational unit to an input 272 of the first boundary circuit 229, and the fourth selector switch 257 couples the output 225 of the MSB ALU 215 of the first computational unit 203 to the input 274 of the second boundary circuit 231.
  • Thus, in the coupled mode, the extension circuit provides the required connections to integrate the two arrays of processor elements of the first and second CUs into a parallel processor having the combined number of PEs. The [0050] MSB processor element 207 of the first CU 203 functions as the MSB PE of the extended processor, the LSB processor element 213 of the second CU 205 becomes the LSB PE of the extended processor. The LSB processor element 209 of the first CU and MSB processor element 211 of the second CU become adjacent intermediate PEs in the extended contiguous array of processor elements.
  • In coupled mode, a bit propagate bus is formed, via the first selector switch, from the output of first ALU of the second CU and the input of the last ALU of the first CU to complete a continuous propagate chain through the series of ALUs of the extended processor, and the output of the first ALU of the first CU is coupled, via the fourth selector switch to an input second boundary circuit. These two connections enable, for example, a left barrel shift in the extended processor, the bit from the [0051] MSB ALU 215 of the first CU being transmitted to the input of the LSB ALU of the second CU via the bus 281, the fourth selector switch 257 and the second boundary circuit.
  • In coupled mode, a connection is formed between the [0052] output 225 of the last ALU 217 of the first CU and the input 270 of the first PE of the second CU, via the second selector switch 253, to permit bit propagation therebetween, and a bus 283 is formed between the output 225 of the last ALU 221 of the second CU and an input 272 to the first boundary circuit 229, via the third selector switch 255. These connections provide the required connections, for example, for a right barrel shift, the bit output from the LSB ALU of the extended processor being transmitted to an input 278 of the MSB processor element via the bus 283, the third switch 255 and the first boundary circuit 229.
  • FIG. 4 shows a schematic diagram of a boundary circuit according to an embodiment of the present invention in more detail. Referring to FIG. 4, two [0053] boundary circuits 307, 309 are shown, together with their respective computational units 303, 305, and an extension circuit 311 for coupling the computational units and the boundary circuits together to form an extended parallel processor. The figure also shows part of two further extension circuits 315, 317, one to the left of the first CU and one to the right of the second CU, illustrating that the CUs may be part of an extended array of any number of computational units.
  • Each [0054] boundary circuit 307, 309 has first, second and third multiplexers 319, 321, 323, an AND gate 325 and a plurality of registers 327, 329, 331, 333, and 335.
  • Each boundary circuit is connected to an [0055] MSB bus 337, for carrying MSB signals either from the MSB ALU of its associated CU, if the CU is operating independently (or it functions as the left most CU of a coupled CU system and therefore carries the MSB of the extended processor), or the MSB bus carries the MSB from the output of the MSB ALU in a composite CU system. Similarly, each boundary circuit is connected to an LSB bus 339, for carrying LSB signals either from the LSB ALU of its associated CU, if the CU is operating independently (or it functions as the right most CU of a coupled CU system), or the LSB ALU carries the LSB from the output of the LSB ALU in a composite CU system.
  • Each boundary circuit is connected to a common General Purpose Input (GPI) bit [0056] line 341, which carries SRB signals from an array controller for controlling operations of the CU's. Each CU is also connected to a common write control bus 343 for carrying write enable signals from the array controller, for controlling write operations from the CU's.
  • Each boundary circuit has a bus [0057] 345 connected between the output of the MSB ALU and the input of the LSB ALU, via the first selector switch 351 of the extension circuit, for carrying a signal designated IAO. This bus 345 is connected to the output of the first multiplexer 319, whose inputs are connected to the MSB, LSB, GPI buses, and an output of each of the five registers. The inputs of the five registers are also coupled for receiving MSB, LSB and GPI signals from MSB, LSB and SRB buses via the third multiplexer 323. Thus, the IAO signal can be any of the MSB or LSB of the local CU, the MSB or LSB of a composite CU system, or an SRB signal.
  • The [0058] second multiplexer 321 is coupled for receiving the GPI signal and signals from the third, fourth and fifth registers (which can latch the SRB signal, the local or composite system MSB and LSB signals), and can broadcast any of these signals (which may be referred to as a broadcast bit, BB) to all ALUs of the local CU simultaneously.
  • The three inputs to the AND gate are connected respectively to the WRITE ENABLE [0059] control signal bus 343, and the output of the first and second registers 327, 329, and the output of the AND gate is used to control write operations from the CU processor elements, for example to memory. It is to be noted that write operations are not only controlled by the array controller, but also by the local CU, via the boundary circuit, and in this embodiment are conditional on the content/state of both first and second registers.
  • In this embodiment, [0060] additional logic 390 is provided for receiving the contents of the second register of each boundary circuit and performing a logical AND operation on the output of all boundary circuits. The output of this Global AND operation can be used to control further operations of the processor.
  • Similarly, in this embodiment, additional logic is provided for receiving the contents of the third register of each boundary circuit and performing a logical OR operation on the output of all boundary circuits. Again, the output of this Global OR operation can be used to control further operations of processor. For example, the signal may be used to indicate that one of the CU's has reached a predetermined condition, and that further processing by the other CU's is not required. The output of the Global OR may then be used by the processor to terminate processing. [0061]
  • The extension circuit may be controlled by the array controller, and may be controlled dynamically in response to the length of the word to be processed to extend or contract the number of CUs required to operate together. [0062]
  • Any number of CUs can be combined to extend the length (i.e. number of bits) of the word that can be processed. For example, a plurality of eight bit computational units may be combined to form one or more 16-bit processors, one or more 32-bit processors, one or more 64-bit processors or, one or more 128-bit processors etc. Each CU has at least one processor element they may be a 1-bit processor element or 2 or more bit processor elements, and may have any number of PE's. Different CUs may contain the same number of PEs, or different number of PEs. Thus, any number of CU's having any number of PE's may be combined to enable a word of a given length to be parallel processed. [0063]
  • FIG. 5 shows an array of computational units, in which two or more adjacent computational units may be combined into a composite computational unit through extension circuitry control signals (the extension circuitry is not shown in FIG. 5). In this embodiment, each CU comprises an 8-bit CU and can be controlled to allow the CU's to operate either individually as 8-bit parallel processors, or combined into 16-bit parallel processors or 32-bit parallel processors. [0064]
  • In operation, when a control signal CUX_SEL is 0, the computational units will operate in 8-bit mode. When the control signal CUX_SEL[2:0] are 0, the CU's will operate in 8-bit mode. When the CUX_SEL[2:1] are 0, and the CUX_SEL[0] is 1, the circuit will operate in 16-bit mode. When the CUX_SEL[2] are 0 and CUX_SEL[1:0] is 1, the circuit will operate in 32-bit mode. When the CUX_SEL[2:0] are 1, all the CUs will be operating together and the circuit will be in 256-bit mode. This embodiment is for illustrative purposes only and any other configurations are possible. [0065]
  • Modifications and changes to the embodiments described above will be apparent to those skilled in the art. [0066]

Claims (27)

1. A circuit comprising a processor having a plurality of processor elements, each having an arithmetic logic unit, and a controller for controlling said processor elements, means for providing a respective bit of a multiple bit word to each of said processor elements, and transmission means for enabling signals to be transmitted between said arithmetic logic units, to enable said units to perform a parallel operation on the bits of said multiple bit word.
2. A circuit as claimed in claim 1, wherein each arithmetic logic unit has an output and an input, and said transmission means includes means for coupling the output of each ALU directly to the input of its adjacent ALU which processes a higher order bit.
3. A circuit as claimed in claim 1, wherein said transmission means includes means for coupling the output of the ALU for processing the most significant bit of the multiple bit word directly to the input of the ALU for processing the least significant bit of the multiple bit word.
4. A circuit as claimed in claim 1, wherein the processor element for processing the MSB of the multiple bit word has an input, and the transmission means includes coupling means for coupling the output of the ALU for processing the LSB directly to the input of MSB processing element.
5. A circuit as claimed in claim 1, further comprising a memory coupleable to each of said processor elements.
6. A die containing a circuit as claimed in claim 1.
7. A circuit comprising a plurality of computational units, each having at least one processor element and extension means for switchably enabling the transmission of signals between said computational units to enable said units to perform a parallel operation on a multiple bit word wherein at least one bit of said word is provided to each computational unit.
8. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the highest order bit of a computational unit to another computational unit that processes the next higher order bit.
9. A circuit as claimed in claim 8, wherein said extension means is capable of coupling said output to the input of the processor element that processes the next higher order bit of said other computational unit.
10. A circuit as claimed in claim 9, wherein said processor element that processes said highest order bit includes an arithmetic logic unit, and said output comprises the output of said arithmetic logic unit.
11. A circuit as claimed in claim 9, wherein said processor element that processes said higher order bit has an arithmetic logic unit, and said input comprises an input to said arithmetic logic unit.
12. A circuit as claimed in claim 7, wherein said extension means includes a selector switch for selectively coupling said input to one of said output and a port capable of receiving a bit from said other computational unit, or another computational unit.
13. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the lowest order bit of a computational unit to another computational unit that processes the next lower order bit.
14. A circuit as claimed in claim 13, wherein said extension means is capable of coupling said output to the input of the processor element that processes the next lower order bit of said other computational unit.
15. A circuit as claimed in claim 14, wherein said processor element that processes said lowest order bit includes an arithmetic logic unit, and said output comprises the output of said arithmetic logic unit.
16. A circuit as claimed in claim 15, wherein said extension means includes a selector switch for selectively coupling said input to one of said output and a port for receiving another bit either from said other computational unit that processes the lower order bit or from another computational unit.
17. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the most significant bit of the multi-bit word to at least one other computational unit that processes one or more bits of said multi-bit word.
18. A circuit as claimed in claim 17, wherein said extension means is capable of coupling the output of said processor element that processes the most significant bit of the multi-bit word to the computational unit that processes the least significant bit of the multi-bit word.
19. A circuit as claimed in claim 7, wherein extension means includes a selector switch for selectively coupling one of the output of the processor element of a computational unit that processes the highest order bit of that computational unit and an output of the processor element that processes the highest order bit of another computational unit, the highest order bit of the other computational unit being higher than the highest order bit of the one computational unit, to an output port coupled to said one computational unit.
20. A circuit as claimed in claim 7, wherein said extension means is capable of coupling the output of the processor element that processes the least significant bit of the multi-bit word to at least one other computational unit that processes one or more bits of said multi-bit word.
21. A circuit as claimed in claim 20, wherein said extension means is capable of coupling the output of the processor element that processes the least significant bit of the multi-bit word to the computational unit that processes the most significant bit of the multiple bit word.
22. A circuit as claimed in claim 7, comprising a selector switch for selectively coupling one of the output of the processor element that processes the lowest order bit of a computational unit and the output of a processor element that processes the lowest order bit of another computational unit, wherein the lowest order bit of the other computational unit is lower than the one computational unit, to a port coupled to the one computational unit or to a computational unit that processes one or more higher order bits than the one computational unit.
23. A circuit as claimed in claim 7, comprising a first CU and a second CU, said first CU having a plurality of processor elements each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the MSB ALU of the first CU to be transmitted to the input of the LSB ALU of the second CU.
24. A circuit as claimed in claim 7, comprising a first CU and a second CU, said second CU having a plurality of processor elements, each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the MSB ALU of the second CU to be transmitted to the input of the LSB ALU of the first CU.
25. A circuit as claimed in claim 7, comprising a first CU and a second CU, the first CU having a plurality of processor elements, each having arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means is arranged to enable a bit output from the LSB ALU of the first CU to be transmitted to the second CU.
26. A circuit as claimed in claim 25, wherein said second CU comprises a plurality of processing elements, and said extension means is arranged to enable a bit from said LSB ALU of said first CU to be transmitted to the input of the MSB ALU of said second CU.
27. A circuit as claimed in claim 7, comprising a first CU and a second CU, said first and second CUs having a plurality of processor elements, each having an arithmetic logic unit arranged together for performing parallel operations on multiple bit data, and wherein said extension means enables the output of the MSB ALU of the second CU to be coupled to the input of the LSB ALU of the first CU.
US10/469,518 2001-03-02 2002-03-04 Apparatus for variable word length computing in an array processor Abandoned US20040254965A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/469,518 US20040254965A1 (en) 2001-03-02 2002-03-04 Apparatus for variable word length computing in an array processor

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US27230101P 2001-03-02 2001-03-02
US10/469,518 US20040254965A1 (en) 2001-03-02 2002-03-04 Apparatus for variable word length computing in an array processor
PCT/CA2002/000279 WO2002071240A2 (en) 2001-03-02 2002-03-04 Apparatus for variable word length computing in an array processor

Publications (1)

Publication Number Publication Date
US20040254965A1 true US20040254965A1 (en) 2004-12-16

Family

ID=23039227

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/469,518 Abandoned US20040254965A1 (en) 2001-03-02 2002-03-04 Apparatus for variable word length computing in an array processor
US11/623,786 Expired - Lifetime US7272691B2 (en) 2001-03-02 2007-01-17 Interconnect switch assembly with input and output ports switch coupling to processor or memory pair and to neighbor ports coupling to adjacent pairs switch assemblies

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/623,786 Expired - Lifetime US7272691B2 (en) 2001-03-02 2007-01-17 Interconnect switch assembly with input and output ports switch coupling to processor or memory pair and to neighbor ports coupling to adjacent pairs switch assemblies

Country Status (7)

Country Link
US (2) US20040254965A1 (en)
EP (3) EP1384160A2 (en)
AT (1) ATE404923T1 (en)
AU (3) AU2002240742A1 (en)
CA (3) CA2478570A1 (en)
DE (1) DE60228223D1 (en)
WO (3) WO2002071246A2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060282826A1 (en) * 2005-06-09 2006-12-14 Dockser Kenneth A Microprocessor with automatic selection of SIMD parallelism
US20060282646A1 (en) * 2005-06-09 2006-12-14 Dockser Kenneth A Software selectable adjustment of SIMD parallelism
US20070011411A1 (en) * 2005-04-29 2007-01-11 Mtekvision Co., Ltd. Data processor apparatus and memory interface
US20070153822A1 (en) * 2002-11-01 2007-07-05 Zarlink Semiconductor V.N. Inc. Media Access Control Device for High Efficiency Ethernet Backplane
US20070186082A1 (en) * 2006-02-06 2007-08-09 Boris Prokopenko Stream Processor with Variable Single Instruction Multiple Data (SIMD) Factor and Common Special Function
US20070226455A1 (en) * 2006-03-13 2007-09-27 Cooke Laurence H Variable clocked heterogeneous serial array processor
US20080130874A1 (en) * 2006-12-01 2008-06-05 International Business Machines Corporation Runtime configurability for crt & non crt mode
US20100138633A1 (en) * 2006-03-13 2010-06-03 Cooke Laurence H Variable clocked heterogeneous serial array processor
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
US8719837B2 (en) 2004-05-19 2014-05-06 Synopsys, Inc. Microprocessor architecture having extendible logic
US20220100255A1 (en) * 2018-07-29 2022-03-31 Redpine Signals, Inc. Unit Element for performing Multiply-Accumulate Operations

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7471643B2 (en) * 2002-07-01 2008-12-30 Panasonic Corporation Loosely-biased heterogeneous reconfigurable arrays
US7461234B2 (en) * 2002-07-01 2008-12-02 Panasonic Corporation Loosely-biased heterogeneous reconfigurable arrays
US7421691B1 (en) * 2003-12-23 2008-09-02 Unisys Corporation System and method for scaling performance of a data processing system
EP2551769A4 (en) * 2010-03-25 2013-11-27 Fujitsu Ltd Multi-core processor system, memory controller control method and memory controller control program
US10142124B2 (en) * 2012-05-24 2018-11-27 Infineon Technologies Ag System and method to transmit data over a bus system
US9798550B2 (en) * 2013-01-09 2017-10-24 Nxp Usa, Inc. Memory access for a vector processor
JP6308095B2 (en) * 2014-10-08 2018-04-11 富士通株式会社 Arithmetic circuit and control method of arithmetic circuit
US9971541B2 (en) 2016-02-17 2018-05-15 Micron Technology, Inc. Apparatuses and methods for data movement
DE102016003362A1 (en) * 2016-03-18 2017-09-21 Giesecke+Devrient Currency Technology Gmbh Device and method for evaluating sensor data for a document of value
US10268389B2 (en) * 2017-02-22 2019-04-23 Micron Technology, Inc. Apparatuses and methods for in-memory operations
US10318168B2 (en) 2017-06-19 2019-06-11 Micron Technology, Inc. Apparatuses and methods for simultaneous in data path compute operations
US20220383446A1 (en) * 2021-05-28 2022-12-01 MemComputing, Inc. Memory graphics processing unit

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4789957A (en) * 1986-03-28 1988-12-06 Texas Instruments Incorporated Status output for a bit slice ALU
US4831519A (en) * 1985-12-12 1989-05-16 Itt Corporation Cellular array processor with variable nesting depth vector control by selective enabling of left and right neighboring processor cells
US5129092A (en) * 1987-06-01 1992-07-07 Applied Intelligent Systems,Inc. Linear chain of parallel processors and method of using same
US5247689A (en) * 1985-02-25 1993-09-21 Ewert Alfred P Parallel digital processor including lateral transfer buses with interrupt switches to form bus interconnection segments
US5546343A (en) * 1990-10-18 1996-08-13 Elliott; Duncan G. Method and apparatus for a single instruction operating multiple processors on a memory chip
US5581773A (en) * 1992-05-12 1996-12-03 Glover; Michael A. Massively parallel SIMD processor which selectively transfers individual contiguously disposed serial memory elements
US5726928A (en) * 1994-01-28 1998-03-10 Goldstar Electron Co., Ltd. Arithmetic logic unit circuit with reduced propagation delays
US5797027A (en) * 1996-02-22 1998-08-18 Sharp Kubushiki Kaisha Data processing device and data processing method
US5878241A (en) * 1990-11-13 1999-03-02 International Business Machine Partitioning of processing elements in a SIMD/MIMD array processor
US6044448A (en) * 1997-12-16 2000-03-28 S3 Incorporated Processor having multiple datapath instances
US6266760B1 (en) * 1996-04-11 2001-07-24 Massachusetts Institute Of Technology Intermediate-grain reconfigurable processing device
US20020124038A1 (en) * 2000-08-18 2002-09-05 Masahiro Saitoh Processor for processing variable length data

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4807184A (en) * 1986-08-11 1989-02-21 Ltv Aerospace Modular multiple processor architecture using distributed cross-point switch
FR2623310B1 (en) 1987-11-16 1990-02-16 Commissariat Energie Atomique DEVICE FOR PROCESSING DATA RELATING TO IMAGE ELEMENTS
US5058053A (en) * 1988-03-31 1991-10-15 International Business Machines Corporation High performance computer system with unidirectional information flow
JPH01253059A (en) * 1988-04-01 1989-10-09 Kokusai Denshin Denwa Co Ltd <Kdd> Parallel signal processing system
US5056000A (en) 1988-06-21 1991-10-08 International Parallel Machines, Inc. Synchronized parallel processing with shared memory
EP0463721A3 (en) * 1990-04-30 1993-06-16 Gennum Corporation Digital signal processing device
US5325500A (en) * 1990-12-14 1994-06-28 Xerox Corporation Parallel processing units on a substrate, each including a column of memory
KR940002573B1 (en) 1991-05-11 1994-03-25 삼성전자 주식회사 Optical disk recording playback device and method
FR2686175B1 (en) * 1992-01-14 1996-12-20 Andre Thepaut MULTIPROCESSOR DATA PROCESSING SYSTEM.
US5937202A (en) * 1993-02-11 1999-08-10 3-D Computing, Inc. High-speed, parallel, processor architecture for front-end electronics, based on a single type of ASIC, and method use thereof
US5590356A (en) * 1994-08-23 1996-12-31 Massachusetts Institute Of Technology Mesh parallel computer architecture apparatus and associated methods
JPH0877002A (en) * 1994-08-31 1996-03-22 Sony Corp Parallel processor device
WO2000062182A2 (en) 1999-04-09 2000-10-19 Clearspeed Technology Limited Parallel data processing apparatus
FR2795840B1 (en) 1999-07-02 2001-08-31 Commissariat Energie Atomique NETWORK OF PARALLEL PROCESSORS WITH FAULT TOLERANCE OF THESE PROCESSORS, AND RECONFIGURATION PROCEDURE APPLICABLE TO SUCH A NETWORK
US6779128B1 (en) * 2000-02-18 2004-08-17 Invensys Systems, Inc. Fault-tolerant data transfer
GB2370381B (en) 2000-12-19 2003-12-24 Picochip Designs Ltd Processor architecture

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5247689A (en) * 1985-02-25 1993-09-21 Ewert Alfred P Parallel digital processor including lateral transfer buses with interrupt switches to form bus interconnection segments
US4831519A (en) * 1985-12-12 1989-05-16 Itt Corporation Cellular array processor with variable nesting depth vector control by selective enabling of left and right neighboring processor cells
US4789957A (en) * 1986-03-28 1988-12-06 Texas Instruments Incorporated Status output for a bit slice ALU
US5129092A (en) * 1987-06-01 1992-07-07 Applied Intelligent Systems,Inc. Linear chain of parallel processors and method of using same
US5956274A (en) * 1990-10-18 1999-09-21 Mosaid Technologies Incorporated Memory device with multiple processors having parallel access to the same memory area
US5546343A (en) * 1990-10-18 1996-08-13 Elliott; Duncan G. Method and apparatus for a single instruction operating multiple processors on a memory chip
US5878241A (en) * 1990-11-13 1999-03-02 International Business Machine Partitioning of processing elements in a SIMD/MIMD array processor
US5581773A (en) * 1992-05-12 1996-12-03 Glover; Michael A. Massively parallel SIMD processor which selectively transfers individual contiguously disposed serial memory elements
US5726928A (en) * 1994-01-28 1998-03-10 Goldstar Electron Co., Ltd. Arithmetic logic unit circuit with reduced propagation delays
US5797027A (en) * 1996-02-22 1998-08-18 Sharp Kubushiki Kaisha Data processing device and data processing method
US6266760B1 (en) * 1996-04-11 2001-07-24 Massachusetts Institute Of Technology Intermediate-grain reconfigurable processing device
US6044448A (en) * 1997-12-16 2000-03-28 S3 Incorporated Processor having multiple datapath instances
US20020124038A1 (en) * 2000-08-18 2002-09-05 Masahiro Saitoh Processor for processing variable length data

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070153822A1 (en) * 2002-11-01 2007-07-05 Zarlink Semiconductor V.N. Inc. Media Access Control Device for High Efficiency Ethernet Backplane
US9003422B2 (en) 2004-05-19 2015-04-07 Synopsys, Inc. Microprocessor architecture having extendible logic
US8719837B2 (en) 2004-05-19 2014-05-06 Synopsys, Inc. Microprocessor architecture having extendible logic
US7757048B2 (en) * 2005-04-29 2010-07-13 Mtekvision Co., Ltd. Data processor apparatus and memory interface
US20070011411A1 (en) * 2005-04-29 2007-01-11 Mtekvision Co., Ltd. Data processor apparatus and memory interface
US20100146315A1 (en) * 2005-06-09 2010-06-10 Qualcomm Incorporated Software Selectable Adjustment of SIMD Parallelism
EP2290527A3 (en) * 2005-06-09 2011-03-16 Qualcomm Incorporated Microprocessor with automatic selection of SIMD parallelism
WO2006135554A3 (en) * 2005-06-09 2007-12-13 Qualcomm Inc Microprocessor with automatic selection of simd parallelism
EP1894091A2 (en) * 2005-06-09 2008-03-05 QUALCOMM Incorporated Microprocessor with automatic selection of simd parallelism
US20060282646A1 (en) * 2005-06-09 2006-12-14 Dockser Kenneth A Software selectable adjustment of SIMD parallelism
EP1894091A4 (en) * 2005-06-09 2008-08-13 Qualcomm Inc Microprocessor with automatic selection of simd parallelism
JP2008544350A (en) * 2005-06-09 2008-12-04 クゥアルコム・インコーポレイテッド Microprocessor with automatic selection of SIMD parallel processing
US7694114B2 (en) * 2005-06-09 2010-04-06 Qualcomm Incorporated Software selectable adjustment of SIMD parallelism
US8799627B2 (en) 2005-06-09 2014-08-05 Qualcomm Incorporated Software selectable adjustment of SIMD parallelism
US20060282826A1 (en) * 2005-06-09 2006-12-14 Dockser Kenneth A Microprocessor with automatic selection of SIMD parallelism
WO2006135554A2 (en) 2005-06-09 2006-12-21 Qualcomm Incorporated Microprocessor with automatic selection of simd parallelism
US7836284B2 (en) * 2005-06-09 2010-11-16 Qualcomm Incorporated Microprocessor with automatic selection of processing parallelism mode based on width data of instructions
KR101006030B1 (en) * 2005-06-09 2011-01-06 퀄컴 인코포레이티드 Microprocessor with automatic selection of simd parallelism
US8122231B2 (en) * 2005-06-09 2012-02-21 Qualcomm Incorporated Software selectable adjustment of SIMD parallelism
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
US20070186082A1 (en) * 2006-02-06 2007-08-09 Boris Prokopenko Stream Processor with Variable Single Instruction Multiple Data (SIMD) Factor and Common Special Function
US20070226455A1 (en) * 2006-03-13 2007-09-27 Cooke Laurence H Variable clocked heterogeneous serial array processor
US8656143B2 (en) 2006-03-13 2014-02-18 Laurence H. Cooke Variable clocked heterogeneous serial array processor
US20100138633A1 (en) * 2006-03-13 2010-06-03 Cooke Laurence H Variable clocked heterogeneous serial array processor
US9329621B2 (en) 2006-03-13 2016-05-03 Laurence H. Cooke Variable clocked serial array processor
US9823689B2 (en) 2006-03-13 2017-11-21 Laurence H. Cooke Variable clocked serial array processor
US8532288B2 (en) * 2006-12-01 2013-09-10 International Business Machines Corporation Selectively isolating processor elements into subsets of processor elements
US20080130874A1 (en) * 2006-12-01 2008-06-05 International Business Machines Corporation Runtime configurability for crt & non crt mode
US20220100255A1 (en) * 2018-07-29 2022-03-31 Redpine Signals, Inc. Unit Element for performing Multiply-Accumulate Operations
US11640196B2 (en) * 2018-07-29 2023-05-02 Ceremorphic, Inc. Unit element for performing multiply-accumulate operations

Also Published As

Publication number Publication date
EP1384158A2 (en) 2004-01-28
US7272691B2 (en) 2007-09-18
EP1381957A2 (en) 2004-01-21
ATE404923T1 (en) 2008-08-15
WO2002071246A3 (en) 2003-05-15
CA2478570A1 (en) 2002-09-12
AU2002252863A1 (en) 2002-09-19
WO2002071239A2 (en) 2002-09-12
US20070118721A1 (en) 2007-05-24
CA2478573A1 (en) 2002-09-12
EP1384158B1 (en) 2008-08-13
AU2002240742A1 (en) 2002-09-19
CA2478573C (en) 2010-05-25
CA2478571A1 (en) 2002-09-12
WO2002071246A2 (en) 2002-09-12
DE60228223D1 (en) 2008-09-25
WO2002071240A3 (en) 2003-05-30
EP1384160A2 (en) 2004-01-28
AU2002238325A1 (en) 2002-09-19
WO2002071239A3 (en) 2003-04-10
WO2002071240A2 (en) 2002-09-12

Similar Documents

Publication Publication Date Title
US20040254965A1 (en) Apparatus for variable word length computing in an array processor
US6496918B1 (en) Intermediate-grain reconfigurable processing device
US4748585A (en) Processor utilizing reconfigurable process segments to accomodate data word length
US7464251B2 (en) Method and apparatus for configuring arbitrary sized data paths comprising multiple context processing elements
US3537074A (en) Parallel operating array computer
US6009451A (en) Method for generating barrel shifter result flags directly from input data
US6006321A (en) Programmable logic datapath that may be used in a field programmable device
US4901268A (en) Multiple function data processor
US6108760A (en) Method and apparatus for position independent reconfiguration in a network of multiple context processing elements
US6754809B1 (en) Data processing apparatus with indirect register file access
US6839831B2 (en) Data processing apparatus with register file bypass
US10340920B1 (en) High performance FPGA addition
JPS6330647B2 (en)
JPH01232463A (en) Data processor system and video processor system equipped therewith
SG44642A1 (en) A massively multiplexed superscalar harvard architecture computer
WO2001031418A2 (en) Wide connections for transferring data between pe&#39;s of an n-dimensional mesh-connected simd array while transferring operands from memory
US6150836A (en) Multilevel logic field programmable device
Mirsky Coarse-Grain Reconfigurable Computing
US7545196B1 (en) Clock distribution for specialized processing block in programmable logic device
US6728741B2 (en) Hardware assist for data block diagonal mirror image transformation
US11016822B1 (en) Cascade streaming between data processing engines in an array
US5192882A (en) Synchronization circuit for parallel processing
US6728863B1 (en) Wide connections for transferring data between PE&#39;s of an N-dimensional mesh-connected SIMD array while transferring operands from memory
US7395294B1 (en) Arithmetic logic unit
WO2006083768A2 (en) Same instruction different operation (sido) computer with short instruction and provision of sending instruction code through data

Legal Events

Date Code Title Description
AS Assignment

Owner name: ATSANA SEMICONDUCTOR CORP., CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GIERNALCZYK, ERIC;STEWART, MALCOLM;REEL/FRAME:015704/0974;SIGNING DATES FROM 20020620 TO 20020624

AS Assignment

Owner name: MTEKVISION CO., LTD., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ATSANA SEMICONDUCTOR CORP.;REEL/FRAME:016703/0945

Effective date: 20050817

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION