US20060026578A1 - Programmable processor architecture hirarchical compilation - Google Patents
Programmable processor architecture hirarchical compilation Download PDFInfo
- Publication number
- US20060026578A1 US20060026578A1 US11/195,429 US19542905A US2006026578A1 US 20060026578 A1 US20060026578 A1 US 20060026578A1 US 19542905 A US19542905 A US 19542905A US 2006026578 A1 US2006026578 A1 US 2006026578A1
- Authority
- US
- United States
- Prior art keywords
- block
- code
- sub
- processor
- scenario
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims description 28
- 230000015654 memory Effects 0.000 claims description 27
- 238000013461 design Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 description 23
- 230000008569 process Effects 0.000 description 16
- 238000004891 communication Methods 0.000 description 11
- 101000574648 Homo sapiens Retinoid-inducible serine carboxypeptidase Proteins 0.000 description 8
- 102100025483 Retinoid-inducible serine carboxypeptidase Human genes 0.000 description 8
- 230000006399 behavior Effects 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000004088 simulation Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000003111 delayed effect Effects 0.000 description 3
- 239000004744 fabric Substances 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 2
- 230000004075 alteration Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 229910052710 silicon Inorganic materials 0.000 description 2
- 239000010703 silicon Substances 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000183290 Scleropages leichardti Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000003292 glue Substances 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002898 library design Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 239000002918 waste heat Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/781—On-chip cache; Off-chip memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7839—Architectures of general purpose stored program computers comprising a single central processing unit with memory
- G06F15/7842—Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers)
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
- G06F9/3826—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
- G06F9/3828—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
- G06F9/3895—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
Definitions
- This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
- processors One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
- Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
- VLIW Very Long Instruction Word
- DSPs DSPs
- VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
- Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
- one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W.
- a scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
- FIG. 1 shows an application 10 with reference to a digital product 12 including an embodiment of the present invention
- FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
- DMA direct memory access
- FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1 .
- FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
- FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.
- a sub-processor (“CoolProcessor) employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
- logic “macro-functional-units” function-oriented dedicated logic
- MSs homogeneous multiprocessors
- the sub-processor employs a heterogeneous interconnect between heterogeneous processors, designed to match multimedia and communications applications.
- a processor 22 comprises a plurality of sub-processors.
- the four sub-processors are split inot two categories.
- the letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths.
- the CoolW sub-processor will support wider rage data bits.
- the sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz.
- the floating-point instruction set includes addition, subtraction, and multiplication.
- the letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications.
- Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type).
- the internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
- a control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor.
- the control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job.
- a rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it.
- Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
- a general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications).
- the GPP includes its own instruction and data memory or cache.
- the interconnect is based on the Sonics “smart” SoC bus.
- An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
- FIG. 1 an application 10 is shown with reference to a digital product 12 including an embodiment of the present invention.
- FIG. 1 is intended to provide the reader with a perspective regarding some, but not necessarily all, of the advantages of a product, which includes an embodiment of the present invention relative to those available in the marketplace.
- the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14 , digital camera device 16 , digital recording or music device 18 and PDA device 20 .
- the product 12 is capable of executing one or more of the functions of the devices 14 - 20 simultaneously yet utilizing less power.
- the product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14 - 20 . It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
- FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
- the processor 22 is coupled to interface circuit 26 through a general purpose bus 30 and to the interface circuit 28 through a general purpose bus 31 and further coupled, through the bus 30 , to a general purpose processor 32 through the bus 31 .
- the circuit 20 is further shown to include a clock reset and power management 34 for generating a clock utilized by the remaining circuits of the circuit 10 , a reset signal utilized in the same manner and circuitry for managing power by the same.
- JTAG Joint Test Action Group
- JTAG Joint Test Action Group
- the interface circuit 26 shown coupled to the bus 30 and interface circuit 28 , shown coupled to the bus 31 , include the blocks 40 - 66 , which are generally known to those of ordinary skill in the art and used by current processors.
- the processor 22 which is a heterogeneous multi-processor, is shown to include shared data memory 70 , shared data memory 72 , a CoolW sub-processor (or block) 74 , a CoolW sub-processor (or block) 76 , a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80 .
- Each of the blocks 74 - 80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82 , the CoolW block 76 has associated therewith an instruction memory 84 , CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88 .
- each of the blocks 74 - 80 has associated therewith a control block.
- the block 74 has associated therewith a control block 90
- the block 76 has associated therewith a control block 92
- the block 78 has associated therewith a control block 94
- the block 80 has associated therewith a control circuit 96 .
- the block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications
- the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
- the blocks 74 - 80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22 .
- the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74 - 80 resulting in the lowest latency path through the sub-processor to which it is coupled. In FIG. 2 , the circuit 24 is shown directly coupled to the block 76 although it may be coupled to any of the blocks 74 , 78 or 80 . Higher priority agents or tasks may be assigned to the block which is directly coupled to the circuit 24 .
- Complicated applications requiring great processing power are not scattered in the circuit 20 , rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
- the circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type.
- W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing.
- N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
- Such applications include, but are not limited to, fast fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite impulse response (FIR) filters, Discrete Cosine transform (DCT) or inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D Bilinear Texture Mapping, Gouraud Shading, Golay Correlation, Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending, Higher-Order Surface Tessellation, Vertex Shade (Trans/Light), Triangle Setup, Full-Screen Anti-aliasing and Quantization.
- FFT fast fourier transform
- FIR IFFT
- DCT Discrete Cosine transform
- IDCT inverse DCT
- Real/Complex FIR filter IIR filter
- RRC Resistance capacitor Root Raise Cosine
- 3D Bilinear Texture Mapping Gouraud Shading, Golay Corre
- N type sub-processors such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
- Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches.
- the sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture.
- Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
- Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22 , as presented below.
- Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic.
- the various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
- the integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors.
- Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously.
- Due to achieving simultaneous application execution on the integrated circuit 20 which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of FIG. 1 .
- the processor 22 offers a single programmable framework to a user to implement multiple functions, such as multimedia complex applications. Of important value is the ability of the integrated circuit 20 and namely, the processor 22 , to support future standards adopted by the industry, which are expected to be of greater complexity than that of today's standards.
- Each of the blocks 74 - 80 can execute only one sequence (or stream) of programs at a given time.
- a sequence of program is referred to a function associated with a particular application.
- FFT is a type of sequence.
- different sequences may be dependent on one another.
- an FFT program once completed may store its results in the memory 70 and the next sequence, may then use the stored result.
- stream flow Different sequences sharing information in this manner or being dependent upon each other in this manner.
- the memories 70 and 72 each include 8 blocks of 16 kilobytes of memory, however, in other embodiments, different size memory may be utilized.
- the instruction memories 82 , 84 , 86 and 88 are used to store instructions for execution by the blocks 74 - 80 , respectively.
- FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1 .
- the hardware components 304 includes the processor 32 , the circuit 26 and circuit 28 and the processor 22 , as described and shown with respect to previous figures.
- a hardware abstraction layer or low level drivers 306 and an operating systems driver 308 cause interfacing or communication between the hardware components 304 and the software architecture 302 .
- the software architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to the hardware components 304 and to a scenario 312 , which is for causing multiple applications 314 to be executed, each application 314 including kernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry.
- FFTs fast fourier transforms
- DCTs DCTs
- FIR Finite Impulse Response
- the software architecture 302 is further shown to include a system level software changes scenarios 318 , which is shown to communicate with an operating systems interface (OSI) 322 and an operating system 320 .
- the operating system 320 is further shown to communicate with the scenario 312 , applications 314 , and kernals 316 .
- the kernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code.
- Each of the applications 314 includes many kernels, such as the kernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application.
- the scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario.
- the scenario 318 causes scenarios to be changed while running on the hardware 304 .
- each of the kernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while the scenario 312 and each of the applications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly.
- a higher level language such as “C” for reasons that will become apparent shortly.
- the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture of FIG. 2 causes simultaneous or concurrent execution of applications, in a hierarchical manner and while maintaining low power consumption.
- the CoolBios 310 includes a set of software functions that allow input and output communication with the processor 22 and eliminates the need for a full operating system running on the processor 22 .
- the hardware component 304 and software architecture 302 provide an environment to load and execute a multi-application scenario.
- a “scenario”, as referred to herein, is a set of applications, such as the applications 314 , executing concurrently. Some examples of each of the applications 314 , as shown in FIG. 3 , include but are not limited to JPEG, MP3, H.264 and 802.11g.
- a scenario 312 interfaces with the operating system 320 and higher-level software through the OSI 322 and the drivers 308 .
- the software architecture 302 and the hardware components 304 of FIG. 3 allow an operating system (OS) to be loaded onto the processor 32 and the drivers 308 to ultimately allow a scenario 312 to be loaded for causing multiple applications to be executed concurrently
- OS operating system
- the scenario 312 includes information, in its header, overhead information, to cause turning on or off each of the different applications 314 .
- the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off.
- Remaining processing power i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed.
- the scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off.
- the pre-compiled and scheduled scenario 312 which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74 . Turning off an application prevents “choking” of the system, that is, bandwidth is improved.
- the system level software changes scenarios 318 causes changing of the scenario 312 , which, as previously-stated, may be done dynamically.
- the code in the latter is in “C” or a high level code.
- the scenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
- the hierarchical software/compiler characteristics of the architecture 300 is shown, in conceptual form, and in reference to a software tools hierarchical column 340 , a hierarchical level column 342 and a hardware hierarchical levels column 344 .
- Each of the pieces of software of the software architecture 302 is taken through a different tool so as to avoid a flat methodology.
- the drivers 306 and 308 are used as tools for the general purpose processor (GPP) 32 on the highest level of the tool column 340 while, in the next level of the hierarchal tools, a scenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors.
- the kernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another.
- the number of threads is limited to the number of sub-processors.
- the way in which applications are handed from one kernel to another is by the kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then another kernel 316 utilizing the stored information in shared memory to perform another function.
- a synchronization code is used for this hand-off, which is done by the scenario 312 and the particular tool is the scenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread.
- a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizing assembler 352 and a low level assembler 354 .
- the goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention.
- the compiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed.
- multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously.
- the ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
- a manufacturer introduces a product, such as a PDA
- this is compiled along with other applications, such as a digital camera or MP3, etc.
- a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention.
- Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
- column 342 states which part or component of the hardware are utilized for the corresponding tool of column 340 . That is, viewed in a row, each location of the row within column 342 corresponds to a like-location in column 340 .
- the GPP IDE 346 is handled by the hardware 304
- the scenario compiler 348 is handled by the processor 22
- the compiler 350 is handled by a control block of one of the sub-processor
- the assemblers 352 and 354 are handled by one or more of the sub-processors, such as the sub-processor 72 .
- the particular hardware hierarchical levels are correspondingly enumerated in column 344 by reference.
- the low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizing assembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling.
- the hierarchical flow of column 340 and the hardware architecture of the processor of FIG. 2 allow for an efficient, low power and flexible processing tool. In prior art, while the assembler 354 and the GPP IDE 346 are used, the remainder of the column 340 are not.
- an hierarchical compilation involves partitioning the application code into a general purpose processor component that allows interaction between the processor 32 and all other hardware components.
- This general-purpose processor component is mainly to allow switching between different scenarios.
- Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory.
- the scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time.
- the compiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90 ) that execute application control code and the scenario control and synchronization code.
- the optimizing assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor.
- the scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources.
- the scenario compiler uses scheduling algorithms from the existing art to create the schedule.
- the scenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler.
- the scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device.
- the scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors.
- the scenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to the scenario compiler 348 .
- the SDL allows for a collection of functionality used in the present invention.
- the Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within.
- SDL is compact, human-readable, and scalable.
- SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
- FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
- the blocks that are shown shaded such as blocks 402 , 412 , 422 , 420 , 424 , 428 , 440 , 430 - 438 are based on a sub-processor and need be executed thereby whereas, the remaining blocks having no shading are irrespective of sub-processors.
- FIG. 4 shows the way in which software is compiled for use by the processor 22 .
- the scenario compiler 348 of FIG. 3 is the scenario compiler block 416 of FIG. 4 and the scenario description block 418 , the adjust scenario description and optional optimization block 408 , and the processor-specific data block 422 serve as support for the scenario compiler block 416 .
- the controller/compiler block 424 is the same as 350 of FIG. 3 .
- the assembly code block 430 and the optimizing assembler 432 is the same as the optimizing assembler 352 of FIG. 3 and the low level assembler and linker block 434 of FIG. 4 is the same as the low level assembler 354 of FIG. 3 .
- the description of these blocks will not be repeated.
- the block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level.
- the block 416 serves as a street map.
- the adjust partitioning and kernels of FIG. 21 of the “PROGRAMMABLE PROCESSOR ARCHITECTURE” patent application incorporated herein by reference is the same as the block 408 .
- the scenario description block 416 serves as input to the scenario compiler block 418 , as does the block 422 .
- the output of the block 418 serves as input to the block 420 and the block 408 serves as input to the block 416 .
- the block 416 describes inter-dependencies between the kernels 316 and applications 314 of FIG. 3 .
- the SDL is used by the block 416 .
- the optimizing block 410 is used to optimize high level code and assembly code.
- the block 410 provides input to the blocks existing assembly code block 412 and the existing high level code block 414 .
- the scenario compiler block 418 receives two sets of information, one is a bottom-up set of information and another is a top-down set of information.
- An example of the former is the kernels 316 , i.e.
- FFTs FFTs, DCTs, etc.
- block 422 FFTs, DCTs, etc.
- block 418 FFTs, DCTs, etc.
- An example of the latter is provided by the block 416 , which is programmed in SDL to serve as control code and for defining inter-dependencies of the kernels and requirement of the application.
- An example of the requirements of the application is the length of time that can be used for processing a frame of information other time-related requirements.
- the block 418 once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule.
- the schedule for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor.
- the schedule information and synchronization information are provided by the block 420 , which receives input from the block 418 .
- the output of the block 420 is provided as input to the block 424 . Having the block 420 receiving its input from the block 418 is generally not performed by prior art techniques due to their design/hardware limitations.
- the hardware architecture based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after the block 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable.
- the non-native compilation and simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, the native simulation block 440 is in native environment.
- the block 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready.
- an off-the-shelf compiler i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific.
- the kernels 316 and the time consumed for executing control code compete.
- the output of the existing assembly functions block 402 serves as input to the existing assembly code 412 , which also receives input from the block 410 .
- the output of the block 412 serves as input to the block 432 , which also receives input from the block 430 .
- the output of the block 432 serves as input to the block 434 and the block 434 serves as input to the block 436 , which provides input to the native simulation block 440 and the implementation complete block 438 .
- the block 440 provides input to the a decision block 442 , which determines whether or not design goals have been made and if so, the block 438 is performed and if not, either the code is optimized by the block 410 or the scenario description is adjusted by the block 408 .
- the output of the blocks 412 , 414 and 420 are all provided to the block 428 . That is, assembly code, high level code and schedule and synchronization code are all provided to the block 428 for execution or simulation.
- the output of the block 428 is provided to a decision block 426 for determining whether or not design goals are met, if so, the process is exited, otherwise, the scenario may be adjusted by the block 408 .
- the output of the block 414 is provided to the block 424 for compilation thereof and the output of the latter is provided to the block 430 .
- optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the block 432 , high level code is optimized by the block 410 and SDL is optimized by the block 416 . This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code.
- the block 424 receives high level code and compiles the same but output assembly code to the block 430 , which is optimized by the block 432 .
- the output of the block 432 is provided to the block 434 for creation of still further low level code and the output of the block 434 is provided to the block 436 for generation of binary object code to be used by a sub-processor.
- the assembly code that is written by the programmer is provided from the block 412 to the block 432 for assembling.
- FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.
- instructions 502 are shown to be provided to the block 434 , which will also be referred to as the latency verification block. That is, the current instruction is referred to as “instruction n”, while the instruction previous to the latter is referred to as “instruction n ⁇ 1”, while the instruction previous to the latter is referred to as “instruction n ⁇ 2” and the instruction previous to that is referred to as “instruction n ⁇ 3” and so on. That is, an n ⁇ 1 instruction is delayed by a program cycle from an n instruction and so on.
- the block 434 performs various functions, shown in FIG. 5 , in blocks or diamond shapes. Such functions include identifying all instructions directly contributing to instruction n 508 , which is then used to determine latency rules relevant to instructions at 506 and it is provided to a decision block at 510 for identifying the earliest instruction. If the earliest instruction is identifiable, the process continues to 514 checking the latency rule against the register values latencies and if any error results, the process continues to 516 at which time an error is reported. If no error is detected at 514 , the next instruction is processed at 518 . The process of FIG. 5 is done during compilation and serves as check for the programmer.
- rules are used to determine what the actual latencies are using a database of rules.
- this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known.
- worst case possibilities are determined.
- latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516 . An example of the programmer's annotation is discussed hereinbelow.
- a computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register.
- strict read-after-write behavior For a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay.
- processors For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register.
- the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together.
- assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
- an assembly language program is annotated by the programmer as follows:
- a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register.
- the lack of an annotation is either an error or indicates a default assumption.
- the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior).
- an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n ⁇ 1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction
- Block 502 has an instruction n together with a sequence of earlier instructions, as previously discussed above. Accordingly and given the hooks placed into the compiler and assembler of the present invention, programming is made simpler and programmer-friendly while or without reducing power.
- Block 434 determines whether the latency annotations are correct for instruction n for this path, while block 520 performs the other usual functions of an assembler for instruction n.
- block 508 the earlier instructions that contribute to the inputs of instruction n are identified.
- Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512 ). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516 ).
Abstract
Description
- This application claims the benefit of U.S. Provisional Patent Application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture” and filed on Aug. 2, 2004 and is a continuation-in-part of U.S. patent application Ser. No. 11/180,068, filed on Jul. 12, 2005 and entitled “PROGRAMMABLE PROCESSOR ARCHITECTURE”, the disclosures of both of which are incorporated herein by reference as though set forth in full.
- 1. Field of the Invention
- This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
- 2. Description of the Prior Art
- With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPods and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
- One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
- To provide specific examples for current processor problems, problems associated with RISCs, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
- A. RISC/Super Scalar Processors
- RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
- Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
- Pros:
-
- Industry wide acceptance has lead to a more matured tool chain and wide software choices
- A robust programming model has resulted from a very efficient automatic code generator used to generate binaries from high level languages like C.
- Processors in the category are very good general purpose solutions.
- Moore's Law can be effectively used for increasing performance.
- Cons:
-
- The general purpose nature of the architecture does not leverage common/specific characteristics of a set or sub-set of applications for better price, power and performance.
- They consume moderate to high amounts of power with respect to the amount of computation provided.
- Performance increase is mostly achieved at the expense of pipeline latency which adversely affects several multimedia and communication algorithms.
- Complicated hardware scheduler, sophisticated control mechanisms and significantly reduced restrictions for more efficient automatic code generation for general algorithms have made this category of solutions less area efficient.
- B. Very Long Instruction Word (VLIW) and DSPs
- VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
- Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
- Pros:
-
- Restricting the solution to the signal processing space improved 3P in comparison with RISC and Super Scalar architectures
- VLIW architectures provide higher level of parallelism relative to RISC and superscalar architectures.
- An efficient tool chain and industry wide acceptance was generated fairly rapidly.
- Automatic code generation and programmability are showing significant improvements as more processors designed for signal processing fall into this category.
- Cons:
-
- Although problem solving capability is reduced to the digital signal processing space, it is too broad for a general solution like VLIW machine to have efficient 3P.
- Control is both expensive and power consuming especially for primitive control code in many multimedia and communication applications.
- Several power and area inefficient techniques were used to make automatic code generation easy. Strong reliance on these techniques by the software community is carrying forward this inefficiency from generation to generation.
- VLIW architectures are not well suited for processing serial code.
- C. Reconfigurable Computing
- Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain FPGA like architectures.
- Pros:
-
- Some designs restricted to a specific application while providing needed flexibility within that application proved to be price, power, performance competitive
- Research showed that such restricted yet flexible solutions can be created to address many application hotspots.
- Cons:
-
- Several designs in this space did not provide an efficient and easy programming solution and therefore was not widely accepted by a community adept in programming DSPs.
- Automatic code generation from higher level languages like C was either virtually impossible or highly inefficient for many of the designs.
- 3P advantage was lost when an attempt was made to combine heterogeneous applications using one type of interconnect and one level of granularity. Degree of utilization of the provided parallelism suffered heavily.
- Reconfiguration overhead was significant in 3P for most designs.
- In many cases, the external interface was complicated because the proprietary reconfigurable fabric did not match industry standard system design methodologies.
- Reconfigurable machines are uni-processors and rely heavily on a tightly integrated RISC even for processing primitive control.
- D. Array of Processors
- Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
- Pros:
-
- Different processors optimized for different sets of applications when connected together using an efficient fabric can help solve a wide range of problems.
- Uniform scaling model allows number processors to be connected together as performance requirements increase.
- Complex algorithms can be efficiently partitioned.
- Cons:
-
- Although performance requirements may be adequately answered, power and price inefficiencies are too high.
- The programming model varies from processor to processor. This makes the job of the application developer much harder.
- Uniform scaling of multiple processors is a very expensive and power consuming resource. This has shown to display some non-determinism that may be detrimental to the performance of the entire system.
- The programming model at the system level suffers from complexity of communicating data, code and control information without any shared memory resources—since shared memory is not uniformly scalable.
- Extensive and repetitive glue logic required to connect different types of processors to a homogeneous network adds to the area inefficiencies, increases power and adds to the latency.
- In light of the foregoing, there is a need for a low-power, inexpensive, efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.
- Briefly, one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
-
FIG. 1 shows anapplication 10 with reference to adigital product 12 including an embodiment of the present invention -
FIG. 2 shows an exemplaryintegrated circuit 20 including a heterogenous, high-performance,scalable processor 22 coupled to a memory controller and direct memory access (DMA)circuit 24 in accordance with an embodiment of the present invention. -
FIG. 3 shows, in conceptual form, anarchitecture 300 includingsoftware architecture 302 in combination with some of thehardware components 304 of thecircuit 20 ofFIG. 1 . -
FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention. -
FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within theblock 434 ofFIG. 4 in accordance with a method and apparatus of the present invention. - A sub-processor (“CoolProcessor) is provided employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
- As shown and described below with reference to
FIG. 2 , the sub-processor employs a heterogeneous interconnect between heterogeneous processors, designed to match multimedia and communications applications. - One embodiment of the present invention employs four sub-processors (referred to as “black boxes” or “processor” in the provisional application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture”). In this patent document, a
processor 22 comprises a plurality of sub-processors. The four sub-processors are split inot two categories. The letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths. The CoolW sub-processor, however, will support wider rage data bits. The sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz. The floating-point instruction set includes addition, subtraction, and multiplication. - The letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications. Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type). The internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
- A control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor. The control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job. A rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it. Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
- A general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications). The GPP includes its own instruction and data memory or cache.
- The interconnect is based on the Sonics “smart” SoC bus. An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
- Referring now to
FIG. 1 , anapplication 10 is shown with reference to adigital product 12 including an embodiment of the present invention.FIG. 1 is intended to provide the reader with a perspective regarding some, but not necessarily all, of the advantages of a product, which includes an embodiment of the present invention relative to those available in the marketplace. - Accordingly, the
product 12 is a converging product in that it incorporates all of the applications that need to be executed by today'smobile phone device 14,digital camera device 16, digital recording ormusic device 18 andPDA device 20. Theproduct 12 is capable of executing one or more of the functions of the devices 14-20 simultaneously yet utilizing less power. - The
product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14-20. It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. -
FIG. 2 shows an exemplaryintegrated circuit 20 including a heterogenous, high-performance,scalable processor 22 coupled to a memory controller and direct memory access (DMA)circuit 24 in accordance with an embodiment of the present invention. Further shown inFIG. 2 , theprocessor 22 is coupled tointerface circuit 26 through ageneral purpose bus 30 and to theinterface circuit 28 through ageneral purpose bus 31 and further coupled, through thebus 30, to ageneral purpose processor 32 through thebus 31. Thecircuit 20 is further shown to include a clock reset andpower management 34 for generating a clock utilized by the remaining circuits of thecircuit 10, a reset signal utilized in the same manner and circuitry for managing power by the same. There is further included in thecircuit 20, a Joint Test Action Group (JTAG)circuit 36. JTAG is used as a standard for testing chips. - The
interface circuit 26 shown coupled to thebus 30 andinterface circuit 28, shown coupled to thebus 31, include the blocks 40-66, which are generally known to those of ordinary skill in the art and used by current processors. - The
processor 22, which is a heterogeneous multi-processor, is shown to include shareddata memory 70, shareddata memory 72, a CoolW sub-processor (or block) 74, a CoolW sub-processor (or block) 76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has associated therewith an instruction memory, for example, theCoolW block 74 has associated therewith aninstruction memory 82, theCoolW block 76 has associated therewith aninstruction memory 84,CoolN block 78 has associated therewith aninstruction memory 86 and theCoolN block 80 has associated therewith aninstruction memory 88. Similarly, each of the blocks 74-80 has associated therewith a control block. Theblock 74 has associated therewith acontrol block 90, theblock 76 has associated therewith acontrol block 92, theblock 78 has associated therewith acontrol block 94 and theblock 80 has associated therewith acontrol circuit 96. Theblock blocks - The blocks 74-80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the
processor 22. Furthermore, thecircuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74-80 resulting in the lowest latency path through the sub-processor to which it is coupled. InFIG. 2 , thecircuit 24 is shown directly coupled to theblock 76 although it may be coupled to any of theblocks circuit 24. - It should be noted that while four blocks 74-80 are shown, other number of blocks may be utilized, however, utilizing additional blocks clearly results in additional die space and higher manufacturing costs.
- Complicated applications requiring great processing power are not scattered in the
circuit 20, rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption. - The
circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type. W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing. N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed. - Different applications require different performance or processing capabilities and are thus, executed by a different type of block or sub-processor. Take for instance, applications that are typically executed by DSPs, they would be generally be processed by W type sub-processors, such as the
blocks FIG. 2 because they characteristically include commonly occurring DSP kernels. Such applications include, but are not limited to, fast fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite impulse response (FIR) filters, Discrete Cosine transform (DCT) or inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D Bilinear Texture Mapping, Gouraud Shading, Golay Correlation, Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending, Higher-Order Surface Tessellation, Vertex Shade (Trans/Light), Triangle Setup, Full-Screen Anti-aliasing and Quantization. - Other commonly occurring DSP kernels can be executed by N type sub-processors, such as
blocks - Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches. The sub-processor architecture of the
processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture. Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors. - Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the
processor 22, as presented below. - “Quasi-Adiabatic Programmable” or Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) Processors. In term of thermodynamics, Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic. The various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
- The
integrated circuit 20 allows as many applications as can be supported by the resources within theprocessor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors. Examples of applications that can be simultaneously or concurrently executed by the integratedcircuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously. Due to achieving simultaneous application execution on theintegrated circuit 20, which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices ofFIG. 1 . Additionally, theprocessor 22 offers a single programmable framework to a user to implement multiple functions, such as multimedia complex applications. Of important value is the ability of theintegrated circuit 20 and namely, theprocessor 22, to support future standards adopted by the industry, which are expected to be of greater complexity than that of today's standards. - Each of the blocks 74-80 can execute only one sequence (or stream) of programs at a given time. A sequence of program is referred to a function associated with a particular application. For example, FFT is a type of sequence. However, different sequences may be dependent on one another. For example, an FFT program, once completed may store its results in the
memory 70 and the next sequence, may then use the stored result. Different sequences sharing information in this manner or being dependent upon each other in this manner is referred to as “stream flow”. - In
FIG. 2 , thememories - The
instruction memories -
FIG. 3 shows, in conceptual form, anarchitecture 300 includingsoftware architecture 302 in combination with some of thehardware components 304 of thecircuit 20 ofFIG. 1 . Thehardware components 304 includes theprocessor 32, thecircuit 26 andcircuit 28 and theprocessor 22, as described and shown with respect to previous figures. - Included within the
software architecture 302, a hardware abstraction layer orlow level drivers 306 and anoperating systems driver 308 cause interfacing or communication between thehardware components 304 and thesoftware architecture 302. Thesoftware architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to thehardware components 304 and to ascenario 312, which is for causingmultiple applications 314 to be executed, eachapplication 314 includingkernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry. Thesoftware architecture 302 is further shown to include a system level software changesscenarios 318, which is shown to communicate with an operating systems interface (OSI) 322 and anoperating system 320. Theoperating system 320 is further shown to communicate with thescenario 312,applications 314, and kernals 316. thekernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code. - Each of the
applications 314 includes many kernels, such as thekernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application. The scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario. Thescenario 318 causes scenarios to be changed while running on thehardware 304. From a software perspective, each of thekernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while thescenario 312 and each of theapplications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly. For now, suffice it to say that the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture ofFIG. 2 , causes simultaneous or concurrent execution of applications, in a hierarchical manner and while maintaining low power consumption. - The
CoolBios 310 includes a set of software functions that allow input and output communication with theprocessor 22 and eliminates the need for a full operating system running on theprocessor 22. - The
hardware component 304 andsoftware architecture 302 provide an environment to load and execute a multi-application scenario. A “scenario”, as referred to herein, is a set of applications, such as theapplications 314, executing concurrently. Some examples of each of theapplications 314, as shown inFIG. 3 , include but are not limited to JPEG, MP3, H.264 and 802.11g. Ascenario 312 interfaces with theoperating system 320 and higher-level software through theOSI 322 and thedrivers 308. - The
software architecture 302 and thehardware components 304 ofFIG. 3 allow an operating system (OS) to be loaded onto theprocessor 32 and thedrivers 308 to ultimately allow ascenario 312 to be loaded for causing multiple applications to be executed concurrently - The
scenario 312 includes information, in its header, overhead information, to cause turning on or off each of thedifferent applications 314. For example, the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off. Remaining processing power, i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed. - Essentially, there are three modes of operation within the
software architecture 302. One is real-time mode, an example of which is 802.11g, which has hardware time constraints. In this case, it is not feasible to add another application because ascenario 312 that includes an 802.11g application has compiled the latter and in the presence of a pre-complied application, a new application cannot be added. Generally, in the presence of applications having a timing constraint, a new application is not readily added or to dynamically change scenarios because it disturbs the processing balance, however, this is not an issue in mobile applications because scenarios are not readily changed in such applications. - The
scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off. The pre-compiled and scheduledscenario 312, which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74. Turning off an application prevents “choking” of the system, that is, bandwidth is improved. - The system level software changes
scenarios 318 causes changing of thescenario 312, which, as previously-stated, may be done dynamically. The code in the latter is in “C” or a high level code. Thescenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif. - On the right-hand side of
FIG. 3 , the hierarchical software/compiler characteristics of thearchitecture 300 is shown, in conceptual form, and in reference to a software toolshierarchical column 340, ahierarchical level column 342 and a hardwarehierarchical levels column 344. Each of the pieces of software of thesoftware architecture 302 is taken through a different tool so as to avoid a flat methodology. - The
drivers tool column 340 while, in the next level of the hierarchal tools, ascenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors. Thekernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another. - The number of threads is limited to the number of sub-processors. The way in which applications are handed from one kernel to another is by the
kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then anotherkernel 316 utilizing the stored information in shared memory to perform another function. A synchronization code is used for this hand-off, which is done by thescenario 312 and the particular tool is thescenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread. - In the next level of the tool hierarchy, as shown in the
column 340, a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizingassembler 352 and alow level assembler 354. The goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention. Thecompiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed. - By changing scenarios, multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously. The ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
- By way of example, if a manufacturer introduces a product, such as a PDA, this is compiled along with other applications, such as a digital camera or MP3, etc., and a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention. Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
- In
FIG. 3 ,column 342 states which part or component of the hardware are utilized for the corresponding tool ofcolumn 340. That is, viewed in a row, each location of the row withincolumn 342 corresponds to a like-location incolumn 340. Thus, theGPP IDE 346 is handled by thehardware 304, thescenario compiler 348 is handled by theprocessor 22 and thecompiler 350 is handled by a control block of one of the sub-processor and theassemblers column 344 by reference. - The
low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizingassembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling. The hierarchical flow ofcolumn 340 and the hardware architecture of the processor ofFIG. 2 allow for an efficient, low power and flexible processing tool. In prior art, while theassembler 354 and theGPP IDE 346 are used, the remainder of thecolumn 340 are not. - With continued reference to
FIG. 3 , an hierarchical compilation involves partitioning the application code into a general purpose processor component that allows interaction between theprocessor 32 and all other hardware components. This general-purpose processor component is mainly to allow switching between different scenarios. - Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory. The
scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time. Thecompiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90) that execute application control code and the scenario control and synchronization code. - The optimizing
assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor. - The
scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources. The scenario compiler uses scheduling algorithms from the existing art to create the schedule. Thescenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler. The scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device. The scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors. Thescenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to thescenario compiler 348. - The SDL allows for a collection of functionality used in the present invention. The Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within. SDL is compact, human-readable, and scalable. SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
-
FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention. InFIG. 4 , the blocks that are shown shaded, such asblocks - Generally,
FIG. 4 shows the way in which software is compiled for use by theprocessor 22. InFIG. 4 , there is shown further details of some of the blocks ofFIG. 3 . Specifically, thescenario compiler 348 ofFIG. 3 is thescenario compiler block 416 ofFIG. 4 and thescenario description block 418, the adjust scenario description andoptional optimization block 408, and the processor-specific data block 422 serve as support for thescenario compiler block 416. The controller/compiler block 424 is the same as 350 ofFIG. 3 . Theassembly code block 430 and the optimizingassembler 432 is the same as the optimizingassembler 352 ofFIG. 3 and the low level assembler and linker block 434 ofFIG. 4 is the same as thelow level assembler 354 ofFIG. 3 . Thus, the description of these blocks will not be repeated. - The
block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level. Theblock 416 serves as a street map. The adjust partitioning and kernels ofFIG. 21 of the “PROGRAMMABLE PROCESSOR ARCHITECTURE” patent application incorporated herein by reference is the same as theblock 408. - The
scenario description block 416 serves as input to thescenario compiler block 418, as does theblock 422. The output of theblock 418 serves as input to theblock 420 and theblock 408 serves as input to theblock 416. Theblock 416 describes inter-dependencies between thekernels 316 andapplications 314 ofFIG. 3 . The SDL is used by theblock 416. The optimizingblock 410 is used to optimize high level code and assembly code. Thus, theblock 410 provides input to the blocks existingassembly code block 412 and the existing highlevel code block 414. Thescenario compiler block 418 receives two sets of information, one is a bottom-up set of information and another is a top-down set of information. An example of the former is thekernels 316, i.e. FFTs, DCTs, etc., which is provided by theblock 422 to theblock 418 and it is assembled, optionally, with power information during assembly. An example of the latter is provided by theblock 416, which is programmed in SDL to serve as control code and for defining inter-dependencies of the kernels and requirement of the application. An example of the requirements of the application is the length of time that can be used for processing a frame of information other time-related requirements. - The
block 418, once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule. The schedule, for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor. The schedule information and synchronization information are provided by theblock 420, which receives input from theblock 418. The output of theblock 420 is provided as input to theblock 424. Having theblock 420 receiving its input from theblock 418 is generally not performed by prior art techniques due to their design/hardware limitations. That is, the hardware architecture, based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after theblock 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable. - The non-native compilation and
simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, thenative simulation block 440 is in native environment. Theblock 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready. Thus, an off-the-shelf compiler, i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific. Thekernels 316 and the time consumed for executing control code compete. - In
FIG. 4 , the output of the existing assembly functions block 402 serves as input to the existingassembly code 412, which also receives input from theblock 410. The output of theblock 412 serves as input to theblock 432, which also receives input from theblock 430. The output of theblock 432 serves as input to theblock 434 and theblock 434 serves as input to theblock 436, which provides input to thenative simulation block 440 and the implementationcomplete block 438. Theblock 440 provides input to the adecision block 442, which determines whether or not design goals have been made and if so, theblock 438 is performed and if not, either the code is optimized by theblock 410 or the scenario description is adjusted by theblock 408. The output of theblocks block 428. That is, assembly code, high level code and schedule and synchronization code are all provided to theblock 428 for execution or simulation. The output of theblock 428 is provided to adecision block 426 for determining whether or not design goals are met, if so, the process is exited, otherwise, the scenario may be adjusted by theblock 408. The output of theblock 414 is provided to theblock 424 for compilation thereof and the output of the latter is provided to theblock 430. - Optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the
block 432, high level code is optimized by theblock 410 and SDL is optimized by theblock 416. This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code. Theblock 424 receives high level code and compiles the same but output assembly code to theblock 430, which is optimized by theblock 432. The output of theblock 432 is provided to theblock 434 for creation of still further low level code and the output of theblock 434 is provided to theblock 436 for generation of binary object code to be used by a sub-processor. The assembly code that is written by the programmer is provided from theblock 412 to theblock 432 for assembling. -
FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within theblock 434 ofFIG. 4 in accordance with a method and apparatus of the present invention. InFIG. 5 ,instructions 502 are shown to be provided to theblock 434, which will also be referred to as the latency verification block. That is, the current instruction is referred to as “instruction n”, while the instruction previous to the latter is referred to as “instruction n−1”, while the instruction previous to the latter is referred to as “instruction n−2” and the instruction previous to that is referred to as “instruction n−3” and so on. That is, an n−1 instruction is delayed by a program cycle from an n instruction and so on. - The
block 434 performs various functions, shown inFIG. 5 , in blocks or diamond shapes. Such functions include identifying all instructions directly contributing toinstruction n 508, which is then used to determine latency rules relevant to instructions at 506 and it is provided to a decision block at 510 for identifying the earliest instruction. If the earliest instruction is identifiable, the process continues to 514 checking the latency rule against the register values latencies and if any error results, the process continues to 516 at which time an error is reported. If no error is detected at 514, the next instruction is processed at 518. The process ofFIG. 5 is done during compilation and serves as check for the programmer. - At 506, rules are used to determine what the actual latencies are using a database of rules. At 510, this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known. At 512, worst case possibilities are determined. At 514, latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516. An example of the programmer's annotation is discussed hereinbelow.
- A computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register. To achieve strict read-after-write behavior for a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay. The advantages of implementing strict read-after-write behavior for all registers are:
-
- (1) The same sequence of instructions can execute correctly on a wider range of processor implementations, and
- (2) Assembly language programming is made easier.
- For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register. Although the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together. For processors of this kind, unfortunately, assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
- Latency Verification:
- In
FIG. 5 , a process is described by which an assembler or other programming tool can verify that a processor's latency rules are being applied correctly by the programmer. First, an assembly language program is annotated by the programmer as follows: - For each register read by each instruction, a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register. The lack of an annotation is either an error or indicates a default assumption. For example, the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior). Whenever the programmer expects a value different from the default assumption, an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n−1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction
-
- add r1, $$r2, $r3
would indicate that the programmer expects the value obtained for register r3 to be the value written by the second previous instruction to write r3, and the value obtained for register r2 to be the value written by the third previous instruction to write r2. In the above example, the current value of register r1 and two values ago of register r2 and the previous value of register r3 are being added. The assembler or block 434 checks to ensure that all of these values are available by performing the process ofFIG. 5 . It should be noted that the annotation need not be a dollar sign, rather, it can be any notation.
- add r1, $$r2, $r3
- Given these annotations, for each instruction, the assembler or other programming tool automatically determines whether the programmer's expectations are correct, by examining the sequences of instructions that can execute previous to the given instruction along all paths leading to the given instruction, and applying the documented latency rules to these sequences.
FIG. 5 shows one possible manifestation of this process as part of a modified assembler.Block 502 has an instruction n together with a sequence of earlier instructions, as previously discussed above. Accordingly and given the hooks placed into the compiler and assembler of the present invention, programming is made simpler and programmer-friendly while or without reducing power. -
Block 434 determines whether the latency annotations are correct for instruction n for this path, whileblock 520 performs the other usual functions of an assembler for instruction n. Inblock 508, the earlier instructions that contribute to the inputs of instruction n are identified.Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516). - Although the present invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/195,429 US20060026578A1 (en) | 2004-08-02 | 2005-08-02 | Programmable processor architecture hirarchical compilation |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US59841704P | 2004-08-02 | 2004-08-02 | |
US11/180,068 US7721069B2 (en) | 2004-07-13 | 2005-07-12 | Low power, high performance, heterogeneous, scalable processor architecture |
US11/195,429 US20060026578A1 (en) | 2004-08-02 | 2005-08-02 | Programmable processor architecture hirarchical compilation |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/180,068 Continuation-In-Part US7721069B2 (en) | 2004-07-13 | 2005-07-12 | Low power, high performance, heterogeneous, scalable processor architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
US20060026578A1 true US20060026578A1 (en) | 2006-02-02 |
Family
ID=35733871
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/180,068 Active 2025-09-28 US7721069B2 (en) | 2004-07-13 | 2005-07-12 | Low power, high performance, heterogeneous, scalable processor architecture |
US11/195,429 Abandoned US20060026578A1 (en) | 2004-08-02 | 2005-08-02 | Programmable processor architecture hirarchical compilation |
US11/848,023 Abandoned US20070294511A1 (en) | 2004-08-02 | 2007-08-30 | Programmable Processor Architecture |
US12/782,431 Active 2025-10-22 US8386751B2 (en) | 2004-07-13 | 2010-05-18 | Low power, high performance, heterogeneous, scalable processor architecture |
US13/775,402 Active US8880850B2 (en) | 2004-07-13 | 2013-02-25 | Low power, high performance, heterogeneous, scalable processor architecture |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/180,068 Active 2025-09-28 US7721069B2 (en) | 2004-07-13 | 2005-07-12 | Low power, high performance, heterogeneous, scalable processor architecture |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/848,023 Abandoned US20070294511A1 (en) | 2004-08-02 | 2007-08-30 | Programmable Processor Architecture |
US12/782,431 Active 2025-10-22 US8386751B2 (en) | 2004-07-13 | 2010-05-18 | Low power, high performance, heterogeneous, scalable processor architecture |
US13/775,402 Active US8880850B2 (en) | 2004-07-13 | 2013-02-25 | Low power, high performance, heterogeneous, scalable processor architecture |
Country Status (2)
Country | Link |
---|---|
US (5) | US7721069B2 (en) |
WO (1) | WO2006017482A2 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070011199A1 (en) * | 2005-06-20 | 2007-01-11 | Microsoft Corporation | Secure and Stable Hosting of Third-Party Extensions to Web Services |
US20080082786A1 (en) * | 2006-10-02 | 2008-04-03 | William Stuart Lovell | Super-scalable, continuous flow instant logic™ binary circuitry actively structured by code-generated pass transistor interconnects |
US20080215768A1 (en) * | 2006-10-24 | 2008-09-04 | Alastair David Reid | Variable coherency support when mapping a computer program to a data processing apparatus |
US20080244507A1 (en) * | 2007-03-30 | 2008-10-02 | Microsoft Corporation | Homogeneous Programming For Heterogeneous Multiprocessor Systems |
US20100042751A1 (en) * | 2007-11-09 | 2010-02-18 | Kouichi Ishino | Data transfer control device, data transfer device, data transfer control method, and semiconductor integrated circuit using reconfigured circuit |
US20100162185A1 (en) * | 2005-08-12 | 2010-06-24 | University Of Sussex | Electronic circuit design |
US20100211760A1 (en) * | 2009-02-18 | 2010-08-19 | Egger Bernhard | Apparatus and method for providing instruction for heterogeneous processor |
US20130036408A1 (en) * | 2011-08-02 | 2013-02-07 | International Business Machines Corporation | Technique for compiling and running high-level programs on heterogeneous computers |
US20130185704A1 (en) * | 2012-01-18 | 2013-07-18 | International Business Machines Corporation | Providing performance tuned versions of compiled code to a cpu in a system of heterogeneous cores |
US8582654B1 (en) * | 2009-09-30 | 2013-11-12 | Hewlett-Packard Development Company, L.P. | Generating a deblocked version of video frames using motion estimation |
US8789063B2 (en) | 2007-03-30 | 2014-07-22 | Microsoft Corporation | Master and subordinate operating system kernels for heterogeneous multiprocessor systems |
US9652817B2 (en) | 2015-03-12 | 2017-05-16 | Samsung Electronics Co., Ltd. | Automated compute kernel fusion, resizing, and interleave |
US11755382B2 (en) * | 2017-11-03 | 2023-09-12 | Coherent Logix, Incorporated | Programming flow for multi-processor system |
Families Citing this family (71)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9330060B1 (en) | 2003-04-15 | 2016-05-03 | Nvidia Corporation | Method and device for encoding and decoding video image data |
US8660182B2 (en) * | 2003-06-09 | 2014-02-25 | Nvidia Corporation | MPEG motion estimation based on dual start points |
US7721069B2 (en) * | 2004-07-13 | 2010-05-18 | 3Plus1 Technology, Inc | Low power, high performance, heterogeneous, scalable processor architecture |
US8731071B1 (en) * | 2005-12-15 | 2014-05-20 | Nvidia Corporation | System for performing finite input response (FIR) filtering in motion estimation |
US8724702B1 (en) | 2006-03-29 | 2014-05-13 | Nvidia Corporation | Methods and systems for motion estimation used in video coding |
US8660380B2 (en) | 2006-08-25 | 2014-02-25 | Nvidia Corporation | Method and system for performing two-dimensional transform on data value array with reduced power consumption |
US7463465B2 (en) * | 2006-12-28 | 2008-12-09 | General Electric Company | Series arc fault current interrupters and methods |
US8756482B2 (en) | 2007-05-25 | 2014-06-17 | Nvidia Corporation | Efficient encoding/decoding of a sequence of data frames |
US8726241B1 (en) * | 2007-06-06 | 2014-05-13 | Rockwell Collins, Inc. | Method and system for the development of high-assurance computing elements |
US9118927B2 (en) | 2007-06-13 | 2015-08-25 | Nvidia Corporation | Sub-pixel interpolation and its application in motion compensated encoding of a video signal |
KR101445794B1 (en) * | 2007-06-18 | 2014-11-03 | 삼성전자주식회사 | Programmable device for software defined radio terminal |
US8873625B2 (en) | 2007-07-18 | 2014-10-28 | Nvidia Corporation | Enhanced compression in representing non-frame-edge blocks of image frames |
US8065506B2 (en) * | 2007-08-17 | 2011-11-22 | Texas Instruments Incorporated | Application specific instruction set processor for digital radio processor receiving chain signal processing |
US9081901B2 (en) * | 2007-10-31 | 2015-07-14 | Raytheon Company | Means of control for reconfigurable computers |
US20090171603A1 (en) * | 2007-12-28 | 2009-07-02 | Sriram Changali | Methods of detecting series arcs in electrical signals |
US8054591B2 (en) * | 2008-07-24 | 2011-11-08 | General Electric Company | Arc detection using discrete wavelet transforms |
WO2010016169A1 (en) * | 2008-08-07 | 2010-02-11 | 日本電気株式会社 | Multiprocessor system and method for controlling the same |
US8755515B1 (en) | 2008-09-29 | 2014-06-17 | Wai Wu | Parallel signal processing system and method |
US8666181B2 (en) | 2008-12-10 | 2014-03-04 | Nvidia Corporation | Adaptive multiple engine image motion detection system and method |
US8159793B2 (en) * | 2008-12-22 | 2012-04-17 | General Electric Company | Arc detection using detailed and approximate coefficients from discrete wavelet transforms |
US8170816B2 (en) | 2008-12-29 | 2012-05-01 | General Electric Company | Parallel arc detection using discrete wavelet transforms |
CN102804165A (en) * | 2009-02-11 | 2012-11-28 | 四次方有限公司 | Front end processor with extendable data path |
US8839214B2 (en) * | 2010-06-30 | 2014-09-16 | Microsoft Corporation | Indexable type transformations |
US20120084539A1 (en) * | 2010-09-29 | 2012-04-05 | Nyland Lars S | Method and sytem for predicate-controlled multi-function instructions |
WO2012052774A2 (en) * | 2010-10-21 | 2012-04-26 | Bluwireless Technology Limited | Data processing units |
US8683178B2 (en) * | 2011-01-21 | 2014-03-25 | Apple Inc. | Sharing a fault-status register when processing vector instructions |
US8533512B2 (en) * | 2011-02-10 | 2013-09-10 | International Business Machines Corporation | Dynamic power and performance calibration of data processing systems |
US9569771B2 (en) | 2011-04-29 | 2017-02-14 | Stephen Lesavich | Method and system for storage and retrieval of blockchain blocks using galois fields |
US9361479B2 (en) | 2011-04-29 | 2016-06-07 | Stephen Lesavich | Method and system for electronic content storage and retrieval using Galois fields and geometric shapes on cloud computing networks |
US9137250B2 (en) | 2011-04-29 | 2015-09-15 | Stephen Lesavich | Method and system for electronic content storage and retrieval using galois fields and information entropy on cloud computing networks |
US9037564B2 (en) | 2011-04-29 | 2015-05-19 | Stephen Lesavich | Method and system for electronic content storage and retrieval with galois fields on cloud computing networks |
KR101918464B1 (en) * | 2011-09-14 | 2018-11-15 | 삼성전자 주식회사 | A processor and a swizzle pattern providing apparatus based on a swizzled virtual register |
US8959419B1 (en) * | 2011-12-20 | 2015-02-17 | Marvell International Ltd. | Viterbi architecture for BDR/BLE |
CN107220029B (en) | 2011-12-23 | 2020-10-27 | 英特尔公司 | Apparatus and method for mask permute instruction |
WO2013095637A1 (en) | 2011-12-23 | 2013-06-27 | Intel Corporation | Apparatus and method of improved permute instructions |
WO2013095630A1 (en) | 2011-12-23 | 2013-06-27 | Intel Corporation | Apparatus and method of improved extract instructions background |
CN111831335A (en) | 2011-12-23 | 2020-10-27 | 英特尔公司 | Apparatus and method for improved insertion of instructions |
US9946540B2 (en) | 2011-12-23 | 2018-04-17 | Intel Corporation | Apparatus and method of improved permute instructions with multiple granularities |
US8762808B2 (en) | 2012-02-22 | 2014-06-24 | Lsi Corporation | Multi-processing architecture for an LTE turbo decoder (TD) |
US20130311753A1 (en) * | 2012-05-19 | 2013-11-21 | Venu Kandadai | Method and device (universal multifunction accelerator) for accelerating computations by parallel computations of middle stratum operations |
US9292298B2 (en) | 2013-07-08 | 2016-03-22 | Arm Limited | Data processing apparatus having SIMD processing circuitry |
US9280845B2 (en) * | 2013-12-27 | 2016-03-08 | Qualcomm Incorporated | Optimized multi-pass rendering on tiled base architectures |
US9672043B2 (en) | 2014-05-12 | 2017-06-06 | International Business Machines Corporation | Processing of multiple instruction streams in a parallel slice processor |
US9665372B2 (en) | 2014-05-12 | 2017-05-30 | International Business Machines Corporation | Parallel slice processor with dynamic instruction stream mapping |
US9760375B2 (en) | 2014-09-09 | 2017-09-12 | International Business Machines Corporation | Register files for storing data operated on by instructions of multiple widths |
US9720696B2 (en) | 2014-09-30 | 2017-08-01 | International Business Machines Corporation | Independent mapping of threads |
US9977678B2 (en) | 2015-01-12 | 2018-05-22 | International Business Machines Corporation | Reconfigurable parallel execution and load-store slice processor |
US10133576B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries |
US10133581B2 (en) | 2015-01-13 | 2018-11-20 | International Business Machines Corporation | Linkable issue queue parallel execution slice for a processor |
US9983875B2 (en) | 2016-03-04 | 2018-05-29 | International Business Machines Corporation | Operation of a multi-slice processor preventing early dependent instruction wakeup |
US10037211B2 (en) | 2016-03-22 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor with an expanded merge fetching queue |
US10346174B2 (en) | 2016-03-24 | 2019-07-09 | International Business Machines Corporation | Operation of a multi-slice processor with dynamic canceling of partial loads |
US10761854B2 (en) | 2016-04-19 | 2020-09-01 | International Business Machines Corporation | Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor |
US10037229B2 (en) | 2016-05-11 | 2018-07-31 | International Business Machines Corporation | Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions |
US9934033B2 (en) | 2016-06-13 | 2018-04-03 | International Business Machines Corporation | Operation of a multi-slice processor implementing simultaneous two-target loads and stores |
US10042647B2 (en) | 2016-06-27 | 2018-08-07 | International Business Machines Corporation | Managing a divided load reorder queue |
US10318419B2 (en) | 2016-08-08 | 2019-06-11 | International Business Machines Corporation | Flush avoidance in a load store unit |
CN106681767B (en) * | 2016-12-29 | 2020-07-10 | 广州华多网络科技有限公司 | Light application adding method and device |
US10387298B2 (en) | 2017-04-04 | 2019-08-20 | Hailo Technologies Ltd | Artificial neural network incorporating emphasis and focus techniques |
US11544545B2 (en) | 2017-04-04 | 2023-01-03 | Hailo Technologies Ltd. | Structured activation based sparsity in an artificial neural network |
US11238334B2 (en) | 2017-04-04 | 2022-02-01 | Hailo Technologies Ltd. | System and method of input alignment for efficient vector operations in an artificial neural network |
US11615297B2 (en) | 2017-04-04 | 2023-03-28 | Hailo Technologies Ltd. | Structured weight based sparsity in an artificial neural network compiler |
US11551028B2 (en) | 2017-04-04 | 2023-01-10 | Hailo Technologies Ltd. | Structured weight based sparsity in an artificial neural network |
WO2019039113A1 (en) * | 2017-08-24 | 2019-02-28 | ソニーセミコンダクタソリューションズ株式会社 | Processor and information processing system |
US20210173662A1 (en) * | 2019-12-05 | 2021-06-10 | International Business Machines Corporation | Processor unit for multiply and accumulate operations |
US11811421B2 (en) | 2020-09-29 | 2023-11-07 | Hailo Technologies Ltd. | Weights safety mechanism in an artificial neural network processor |
US11263077B1 (en) | 2020-09-29 | 2022-03-01 | Hailo Technologies Ltd. | Neural network intermediate results safety mechanism in an artificial neural network processor |
US11237894B1 (en) | 2020-09-29 | 2022-02-01 | Hailo Technologies Ltd. | Layer control unit instruction addressing safety mechanism in an artificial neural network processor |
US11874900B2 (en) | 2020-09-29 | 2024-01-16 | Hailo Technologies Ltd. | Cluster interlayer safety mechanism in an artificial neural network processor |
US11221929B1 (en) | 2020-09-29 | 2022-01-11 | Hailo Technologies Ltd. | Data stream fault detection mechanism in an artificial neural network processor |
US20230176863A1 (en) * | 2021-12-03 | 2023-06-08 | Taiwan Semiconductor Manufacturing Company, Ltd. | Memory interface |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5878085A (en) * | 1997-08-15 | 1999-03-02 | Sicom, Inc. | Trellis coded modulation communications using pilot bits to resolve phase ambiguities |
US5909559A (en) * | 1997-04-04 | 1999-06-01 | Texas Instruments Incorporated | Bus bridge device including data bus of first width for a first processor, memory controller, arbiter circuit and second processor having a different second data width |
US6166748A (en) * | 1995-11-22 | 2000-12-26 | Nintendo Co., Ltd. | Interface for a high performance low cost video game system with coprocessor providing high speed efficient 3D graphics and digital audio signal processing |
US20020108104A1 (en) * | 2000-09-13 | 2002-08-08 | Xueshu Song | Certification and manual compiling wizard |
US20030014742A1 (en) * | 2001-07-09 | 2003-01-16 | Sasken Communication Technologies Limited | Technique for compiling computer code to reduce energy consumption while executing the code |
US20040078411A1 (en) * | 2002-10-22 | 2004-04-22 | Joshua Porten | Galois field arithmetic unit for use within a processor |
US20040123276A1 (en) * | 2002-12-19 | 2004-06-24 | Paul Knueven | Assembly directives for the support of multi-language programming |
US20070044079A1 (en) * | 2005-06-02 | 2007-02-22 | Tharas Systems Inc. | A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE69124437T2 (en) * | 1990-08-09 | 1997-07-03 | Silicon Graphics Inc | Method and device for reversing byte order in a computer |
KR100248903B1 (en) | 1992-09-29 | 2000-03-15 | 야스카와 히데아키 | System and method for handling load and/or store operating in a superscalar microprocessor |
US5867726A (en) * | 1995-05-02 | 1999-02-02 | Hitachi, Ltd. | Microcomputer |
US5987561A (en) | 1995-08-31 | 1999-11-16 | Advanced Micro Devices, Inc. | Superscalar microprocessor employing a data cache capable of performing store accesses in a single clock cycle |
US5784602A (en) * | 1996-10-08 | 1998-07-21 | Advanced Risc Machines Limited | Method and apparatus for digital signal processing for integrated circuit architecture |
US5951677A (en) * | 1998-05-29 | 1999-09-14 | Texas Instruments Incorporated | Efficient hardware implementation of euclidean array processing in reed-solomon decoding |
US6247158B1 (en) * | 1998-11-30 | 2001-06-12 | Itt Manufacturing Enterprises, Inc. | Digital broadcasting system and method |
US6282631B1 (en) * | 1998-12-23 | 2001-08-28 | National Semiconductor Corporation | Programmable RISC-DSP architecture |
US6832306B1 (en) * | 1999-10-25 | 2004-12-14 | Intel Corporation | Method and apparatus for a unified RISC/DSP pipeline controller for both reduced instruction set computer (RISC) control instructions and digital signal processing (DSP) instructions |
US6539467B1 (en) | 1999-11-15 | 2003-03-25 | Texas Instruments Incorporated | Microprocessor with non-aligned memory access |
US6732203B2 (en) * | 2000-01-31 | 2004-05-04 | Intel Corporation | Selectively multiplexing memory coupling global bus data bits to narrower functional unit coupling local bus |
JP2001338492A (en) * | 2000-05-26 | 2001-12-07 | Matsushita Electric Ind Co Ltd | Semiconductor device and control method |
US6922716B2 (en) * | 2001-07-13 | 2005-07-26 | Motorola, Inc. | Method and apparatus for vector processing |
US6986092B2 (en) * | 2001-12-04 | 2006-01-10 | Qualcomm Inc. | Erasure-and-single-error correction decoder for linear block codes |
US7360055B2 (en) * | 2003-02-10 | 2008-04-15 | Broadcom Corporation | Two address map for transactions between an X-bit processor and a Y-bit wide memory |
US7721069B2 (en) * | 2004-07-13 | 2010-05-18 | 3Plus1 Technology, Inc | Low power, high performance, heterogeneous, scalable processor architecture |
-
2005
- 2005-07-12 US US11/180,068 patent/US7721069B2/en active Active
- 2005-08-02 US US11/195,429 patent/US20060026578A1/en not_active Abandoned
- 2005-08-02 WO PCT/US2005/027381 patent/WO2006017482A2/en active Application Filing
-
2007
- 2007-08-30 US US11/848,023 patent/US20070294511A1/en not_active Abandoned
-
2010
- 2010-05-18 US US12/782,431 patent/US8386751B2/en active Active
-
2013
- 2013-02-25 US US13/775,402 patent/US8880850B2/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6166748A (en) * | 1995-11-22 | 2000-12-26 | Nintendo Co., Ltd. | Interface for a high performance low cost video game system with coprocessor providing high speed efficient 3D graphics and digital audio signal processing |
US5909559A (en) * | 1997-04-04 | 1999-06-01 | Texas Instruments Incorporated | Bus bridge device including data bus of first width for a first processor, memory controller, arbiter circuit and second processor having a different second data width |
US5878085A (en) * | 1997-08-15 | 1999-03-02 | Sicom, Inc. | Trellis coded modulation communications using pilot bits to resolve phase ambiguities |
US20020108104A1 (en) * | 2000-09-13 | 2002-08-08 | Xueshu Song | Certification and manual compiling wizard |
US20030014742A1 (en) * | 2001-07-09 | 2003-01-16 | Sasken Communication Technologies Limited | Technique for compiling computer code to reduce energy consumption while executing the code |
US20040078411A1 (en) * | 2002-10-22 | 2004-04-22 | Joshua Porten | Galois field arithmetic unit for use within a processor |
US20040123276A1 (en) * | 2002-12-19 | 2004-06-24 | Paul Knueven | Assembly directives for the support of multi-language programming |
US20070044079A1 (en) * | 2005-06-02 | 2007-02-22 | Tharas Systems Inc. | A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070011199A1 (en) * | 2005-06-20 | 2007-01-11 | Microsoft Corporation | Secure and Stable Hosting of Third-Party Extensions to Web Services |
US8849968B2 (en) | 2005-06-20 | 2014-09-30 | Microsoft Corporation | Secure and stable hosting of third-party extensions to web services |
US20100162185A1 (en) * | 2005-08-12 | 2010-06-24 | University Of Sussex | Electronic circuit design |
US7895560B2 (en) | 2006-10-02 | 2011-02-22 | William Stuart Lovell | Continuous flow instant logic binary circuitry actively structured by code-generated pass transistor interconnects |
US20080082786A1 (en) * | 2006-10-02 | 2008-04-03 | William Stuart Lovell | Super-scalable, continuous flow instant logic™ binary circuitry actively structured by code-generated pass transistor interconnects |
WO2008042186A2 (en) * | 2006-10-02 | 2008-04-10 | Lovell William S | Information processing using binary gates structured by code-selected pass transistors |
WO2008042186A3 (en) * | 2006-10-02 | 2008-09-25 | William S Lovell | Information processing using binary gates structured by code-selected pass transistors |
US20080215768A1 (en) * | 2006-10-24 | 2008-09-04 | Alastair David Reid | Variable coherency support when mapping a computer program to a data processing apparatus |
US8250549B2 (en) * | 2006-10-24 | 2012-08-21 | Arm Limited | Variable coherency support when mapping a computer program to a data processing apparatus |
US20080244507A1 (en) * | 2007-03-30 | 2008-10-02 | Microsoft Corporation | Homogeneous Programming For Heterogeneous Multiprocessor Systems |
US8789063B2 (en) | 2007-03-30 | 2014-07-22 | Microsoft Corporation | Master and subordinate operating system kernels for heterogeneous multiprocessor systems |
US20100042751A1 (en) * | 2007-11-09 | 2010-02-18 | Kouichi Ishino | Data transfer control device, data transfer device, data transfer control method, and semiconductor integrated circuit using reconfigured circuit |
US9710241B2 (en) * | 2009-02-18 | 2017-07-18 | Samsung Electronics Co., Ltd. | Apparatus and method for providing instruction for heterogeneous processor |
US20100211760A1 (en) * | 2009-02-18 | 2010-08-19 | Egger Bernhard | Apparatus and method for providing instruction for heterogeneous processor |
US8582654B1 (en) * | 2009-09-30 | 2013-11-12 | Hewlett-Packard Development Company, L.P. | Generating a deblocked version of video frames using motion estimation |
US8789026B2 (en) * | 2011-08-02 | 2014-07-22 | International Business Machines Corporation | Technique for compiling and running high-level programs on heterogeneous computers |
US8938725B2 (en) | 2011-08-02 | 2015-01-20 | International Business Machines Corporation | Technique for compiling and running high-level programs on heterogeneous computers |
US20130036408A1 (en) * | 2011-08-02 | 2013-02-07 | International Business Machines Corporation | Technique for compiling and running high-level programs on heterogeneous computers |
US8776035B2 (en) * | 2012-01-18 | 2014-07-08 | International Business Machines Corporation | Providing performance tuned versions of compiled code to a CPU in a system of heterogeneous cores |
US20130185704A1 (en) * | 2012-01-18 | 2013-07-18 | International Business Machines Corporation | Providing performance tuned versions of compiled code to a cpu in a system of heterogeneous cores |
US9195443B2 (en) | 2012-01-18 | 2015-11-24 | International Business Machines Corporation | Providing performance tuned versions of compiled code to a CPU in a system of heterogeneous cores |
US9652817B2 (en) | 2015-03-12 | 2017-05-16 | Samsung Electronics Co., Ltd. | Automated compute kernel fusion, resizing, and interleave |
US11755382B2 (en) * | 2017-11-03 | 2023-09-12 | Coherent Logix, Incorporated | Programming flow for multi-processor system |
Also Published As
Publication number | Publication date |
---|---|
US8386751B2 (en) | 2013-02-26 |
US20110131393A1 (en) | 2011-06-02 |
WO2006017482A2 (en) | 2006-02-16 |
US20070294511A1 (en) | 2007-12-20 |
US20060015703A1 (en) | 2006-01-19 |
WO2006017482A3 (en) | 2007-11-15 |
US20130238878A1 (en) | 2013-09-12 |
US8880850B2 (en) | 2014-11-04 |
US7721069B2 (en) | 2010-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20060026578A1 (en) | Programmable processor architecture hirarchical compilation | |
US7278137B1 (en) | Methods and apparatus for compiling instructions for a data processor | |
US20190004878A1 (en) | Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features | |
KR101713815B1 (en) | A tile-based processor architecture model for high efficiency embedded homogeneous multicore platforms | |
Lodi et al. | A VLIW processor with reconfigurable instruction set for embedded applications | |
US6826674B1 (en) | Program product and data processor | |
US20130290693A1 (en) | Method and Apparatus for the Automatic Generation of RTL from an Untimed C or C++ Description as a Fine-Grained Specialization of a Micro-processor Soft Core | |
Glossner et al. | The sandbridge sb3011 platform | |
David et al. | DART: A Dynamically Reconfigurable Architecture Dealing with Future Mobile Telecommunications Constraints. | |
US20220100680A1 (en) | Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits | |
Voitsechov et al. | Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays | |
Lodi et al. | A pipelined configurable gate array for embedded processors | |
US9329872B2 (en) | Method and apparatus for the definition and generation of configurable, high performance low-power embedded microprocessor cores | |
Sias et al. | Enhancing loop buffering of media and telecommunications applications using low-overhead predication | |
Gesper et al. | Evaluation of different processor architecture organizations for on-site electronics in Harsh environments | |
US20120017067A1 (en) | On-demand predicate registers | |
Glossner et al. | The sandbridge sandblaster communications processor | |
Podobas | Accelerating parallel computations with openmp-driven system-on-chip generation for fpgas | |
Balfour | Efficient embedded computing | |
Glossner et al. | A multithreaded processor architecture for SDR | |
Vassiliadis et al. | The arise approach for extending embedded processors with arbitrary hardware accelerators | |
Chen et al. | Flexible heterogeneous multicore architectures for versatile media processing via customized long instruction words | |
Raghavan et al. | Distributed loop controller for multithreading in unithreaded ILP architectures | |
Nguyen et al. | Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator | |
Jungeblut et al. | A systematic approach for optimized bypass configurations for application-specific embedded processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: 3PLUS1 TECHNOLOGY, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RAMCHANDRAN, AMIT;HAUSER, JR., JOHN REID;REEL/FRAME:016985/0701 Effective date: 20050802 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: 3PLUS1 TECHNOLOGY, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HAUSER, JOHN REID, JR.;RAMCHANDRAN, AMIT;SIGNING DATES FROM 20100806 TO 20100816;REEL/FRAME:024878/0807 |
|
AS | Assignment |
Owner name: ICELERO LLC, DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:3PLUS1 TECHNOLOGY INC.;REEL/FRAME:024892/0154 Effective date: 20100818 |
|
AS | Assignment |
Owner name: SATELLITE INVESTORS, LLC, DELAWARE Free format text: SECURITY AGREEMENT;ASSIGNOR:ICELERO, LLC;REEL/FRAME:024900/0800 Effective date: 20100818 |