US20060026578A1 - Programmable processor architecture hirarchical compilation - Google Patents

Programmable processor architecture hirarchical compilation Download PDF

Info

Publication number
US20060026578A1
US20060026578A1 US11/195,429 US19542905A US2006026578A1 US 20060026578 A1 US20060026578 A1 US 20060026578A1 US 19542905 A US19542905 A US 19542905A US 2006026578 A1 US2006026578 A1 US 2006026578A1
Authority
US
United States
Prior art keywords
block
code
sub
processor
scenario
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/195,429
Inventor
Amit Ramchandran
John Hauser
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Icelero LLC
Original Assignee
3Plus1 Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 3Plus1 Technology Inc filed Critical 3Plus1 Technology Inc
Priority to US11/195,429 priority Critical patent/US20060026578A1/en
Assigned to 3PLUS1 TECHNOLOGY, INC. reassignment 3PLUS1 TECHNOLOGY, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HAUSER, JR., JOHN REID, RAMCHANDRAN, AMIT
Publication of US20060026578A1 publication Critical patent/US20060026578A1/en
Assigned to 3PLUS1 TECHNOLOGY, INC. reassignment 3PLUS1 TECHNOLOGY, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: RAMCHANDRAN, AMIT, HAUSER, JOHN REID, JR.
Assigned to ICELERO LLC reassignment ICELERO LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: 3PLUS1 TECHNOLOGY INC.
Assigned to SATELLITE INVESTORS, LLC reassignment SATELLITE INVESTORS, LLC SECURITY AGREEMENT Assignors: ICELERO, LLC
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • G06F15/7842Architectures of general purpose stored program computers comprising a single central processing unit with memory on one IC chip (single chip microcontrollers)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros

Definitions

  • This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
  • processors One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
  • Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
  • VLIW Very Long Instruction Word
  • DSPs DSPs
  • VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
  • Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
  • one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W.
  • a scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
  • FIG. 1 shows an application 10 with reference to a digital product 12 including an embodiment of the present invention
  • FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
  • DMA direct memory access
  • FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1 .
  • FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
  • FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.
  • a sub-processor (“CoolProcessor) employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
  • logic “macro-functional-units” function-oriented dedicated logic
  • MSs homogeneous multiprocessors
  • the sub-processor employs a heterogeneous interconnect between heterogeneous processors, designed to match multimedia and communications applications.
  • a processor 22 comprises a plurality of sub-processors.
  • the four sub-processors are split inot two categories.
  • the letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths.
  • the CoolW sub-processor will support wider rage data bits.
  • the sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz.
  • the floating-point instruction set includes addition, subtraction, and multiplication.
  • the letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications.
  • Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type).
  • the internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
  • a control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor.
  • the control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job.
  • a rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it.
  • Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
  • a general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications).
  • the GPP includes its own instruction and data memory or cache.
  • the interconnect is based on the Sonics “smart” SoC bus.
  • An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
  • FIG. 1 an application 10 is shown with reference to a digital product 12 including an embodiment of the present invention.
  • FIG. 1 is intended to provide the reader with a perspective regarding some, but not necessarily all, of the advantages of a product, which includes an embodiment of the present invention relative to those available in the marketplace.
  • the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14 , digital camera device 16 , digital recording or music device 18 and PDA device 20 .
  • the product 12 is capable of executing one or more of the functions of the devices 14 - 20 simultaneously yet utilizing less power.
  • the product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14 - 20 . It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
  • FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
  • the processor 22 is coupled to interface circuit 26 through a general purpose bus 30 and to the interface circuit 28 through a general purpose bus 31 and further coupled, through the bus 30 , to a general purpose processor 32 through the bus 31 .
  • the circuit 20 is further shown to include a clock reset and power management 34 for generating a clock utilized by the remaining circuits of the circuit 10 , a reset signal utilized in the same manner and circuitry for managing power by the same.
  • JTAG Joint Test Action Group
  • JTAG Joint Test Action Group
  • the interface circuit 26 shown coupled to the bus 30 and interface circuit 28 , shown coupled to the bus 31 , include the blocks 40 - 66 , which are generally known to those of ordinary skill in the art and used by current processors.
  • the processor 22 which is a heterogeneous multi-processor, is shown to include shared data memory 70 , shared data memory 72 , a CoolW sub-processor (or block) 74 , a CoolW sub-processor (or block) 76 , a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80 .
  • Each of the blocks 74 - 80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82 , the CoolW block 76 has associated therewith an instruction memory 84 , CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88 .
  • each of the blocks 74 - 80 has associated therewith a control block.
  • the block 74 has associated therewith a control block 90
  • the block 76 has associated therewith a control block 92
  • the block 78 has associated therewith a control block 94
  • the block 80 has associated therewith a control circuit 96 .
  • the block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications
  • the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
  • the blocks 74 - 80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22 .
  • the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74 - 80 resulting in the lowest latency path through the sub-processor to which it is coupled. In FIG. 2 , the circuit 24 is shown directly coupled to the block 76 although it may be coupled to any of the blocks 74 , 78 or 80 . Higher priority agents or tasks may be assigned to the block which is directly coupled to the circuit 24 .
  • Complicated applications requiring great processing power are not scattered in the circuit 20 , rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
  • the circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type.
  • W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing.
  • N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
  • Such applications include, but are not limited to, fast fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite impulse response (FIR) filters, Discrete Cosine transform (DCT) or inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D Bilinear Texture Mapping, Gouraud Shading, Golay Correlation, Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending, Higher-Order Surface Tessellation, Vertex Shade (Trans/Light), Triangle Setup, Full-Screen Anti-aliasing and Quantization.
  • FFT fast fourier transform
  • FIR IFFT
  • DCT Discrete Cosine transform
  • IDCT inverse DCT
  • Real/Complex FIR filter IIR filter
  • RRC Resistance capacitor Root Raise Cosine
  • 3D Bilinear Texture Mapping Gouraud Shading, Golay Corre
  • N type sub-processors such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
  • Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches.
  • the sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture.
  • Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
  • Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22 , as presented below.
  • Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic.
  • the various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
  • the integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors.
  • Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously.
  • Due to achieving simultaneous application execution on the integrated circuit 20 which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of FIG. 1 .
  • the processor 22 offers a single programmable framework to a user to implement multiple functions, such as multimedia complex applications. Of important value is the ability of the integrated circuit 20 and namely, the processor 22 , to support future standards adopted by the industry, which are expected to be of greater complexity than that of today's standards.
  • Each of the blocks 74 - 80 can execute only one sequence (or stream) of programs at a given time.
  • a sequence of program is referred to a function associated with a particular application.
  • FFT is a type of sequence.
  • different sequences may be dependent on one another.
  • an FFT program once completed may store its results in the memory 70 and the next sequence, may then use the stored result.
  • stream flow Different sequences sharing information in this manner or being dependent upon each other in this manner.
  • the memories 70 and 72 each include 8 blocks of 16 kilobytes of memory, however, in other embodiments, different size memory may be utilized.
  • the instruction memories 82 , 84 , 86 and 88 are used to store instructions for execution by the blocks 74 - 80 , respectively.
  • FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1 .
  • the hardware components 304 includes the processor 32 , the circuit 26 and circuit 28 and the processor 22 , as described and shown with respect to previous figures.
  • a hardware abstraction layer or low level drivers 306 and an operating systems driver 308 cause interfacing or communication between the hardware components 304 and the software architecture 302 .
  • the software architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to the hardware components 304 and to a scenario 312 , which is for causing multiple applications 314 to be executed, each application 314 including kernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry.
  • FFTs fast fourier transforms
  • DCTs DCTs
  • FIR Finite Impulse Response
  • the software architecture 302 is further shown to include a system level software changes scenarios 318 , which is shown to communicate with an operating systems interface (OSI) 322 and an operating system 320 .
  • the operating system 320 is further shown to communicate with the scenario 312 , applications 314 , and kernals 316 .
  • the kernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code.
  • Each of the applications 314 includes many kernels, such as the kernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application.
  • the scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario.
  • the scenario 318 causes scenarios to be changed while running on the hardware 304 .
  • each of the kernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while the scenario 312 and each of the applications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly.
  • a higher level language such as “C” for reasons that will become apparent shortly.
  • the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture of FIG. 2 causes simultaneous or concurrent execution of applications, in a hierarchical manner and while maintaining low power consumption.
  • the CoolBios 310 includes a set of software functions that allow input and output communication with the processor 22 and eliminates the need for a full operating system running on the processor 22 .
  • the hardware component 304 and software architecture 302 provide an environment to load and execute a multi-application scenario.
  • a “scenario”, as referred to herein, is a set of applications, such as the applications 314 , executing concurrently. Some examples of each of the applications 314 , as shown in FIG. 3 , include but are not limited to JPEG, MP3, H.264 and 802.11g.
  • a scenario 312 interfaces with the operating system 320 and higher-level software through the OSI 322 and the drivers 308 .
  • the software architecture 302 and the hardware components 304 of FIG. 3 allow an operating system (OS) to be loaded onto the processor 32 and the drivers 308 to ultimately allow a scenario 312 to be loaded for causing multiple applications to be executed concurrently
  • OS operating system
  • the scenario 312 includes information, in its header, overhead information, to cause turning on or off each of the different applications 314 .
  • the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off.
  • Remaining processing power i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed.
  • the scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off.
  • the pre-compiled and scheduled scenario 312 which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74 . Turning off an application prevents “choking” of the system, that is, bandwidth is improved.
  • the system level software changes scenarios 318 causes changing of the scenario 312 , which, as previously-stated, may be done dynamically.
  • the code in the latter is in “C” or a high level code.
  • the scenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
  • the hierarchical software/compiler characteristics of the architecture 300 is shown, in conceptual form, and in reference to a software tools hierarchical column 340 , a hierarchical level column 342 and a hardware hierarchical levels column 344 .
  • Each of the pieces of software of the software architecture 302 is taken through a different tool so as to avoid a flat methodology.
  • the drivers 306 and 308 are used as tools for the general purpose processor (GPP) 32 on the highest level of the tool column 340 while, in the next level of the hierarchal tools, a scenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors.
  • the kernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another.
  • the number of threads is limited to the number of sub-processors.
  • the way in which applications are handed from one kernel to another is by the kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then another kernel 316 utilizing the stored information in shared memory to perform another function.
  • a synchronization code is used for this hand-off, which is done by the scenario 312 and the particular tool is the scenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread.
  • a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizing assembler 352 and a low level assembler 354 .
  • the goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention.
  • the compiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed.
  • multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously.
  • the ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
  • a manufacturer introduces a product, such as a PDA
  • this is compiled along with other applications, such as a digital camera or MP3, etc.
  • a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention.
  • Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
  • column 342 states which part or component of the hardware are utilized for the corresponding tool of column 340 . That is, viewed in a row, each location of the row within column 342 corresponds to a like-location in column 340 .
  • the GPP IDE 346 is handled by the hardware 304
  • the scenario compiler 348 is handled by the processor 22
  • the compiler 350 is handled by a control block of one of the sub-processor
  • the assemblers 352 and 354 are handled by one or more of the sub-processors, such as the sub-processor 72 .
  • the particular hardware hierarchical levels are correspondingly enumerated in column 344 by reference.
  • the low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizing assembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling.
  • the hierarchical flow of column 340 and the hardware architecture of the processor of FIG. 2 allow for an efficient, low power and flexible processing tool. In prior art, while the assembler 354 and the GPP IDE 346 are used, the remainder of the column 340 are not.
  • an hierarchical compilation involves partitioning the application code into a general purpose processor component that allows interaction between the processor 32 and all other hardware components.
  • This general-purpose processor component is mainly to allow switching between different scenarios.
  • Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory.
  • the scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time.
  • the compiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90 ) that execute application control code and the scenario control and synchronization code.
  • the optimizing assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor.
  • the scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources.
  • the scenario compiler uses scheduling algorithms from the existing art to create the schedule.
  • the scenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler.
  • the scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device.
  • the scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors.
  • the scenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to the scenario compiler 348 .
  • the SDL allows for a collection of functionality used in the present invention.
  • the Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within.
  • SDL is compact, human-readable, and scalable.
  • SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
  • FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
  • the blocks that are shown shaded such as blocks 402 , 412 , 422 , 420 , 424 , 428 , 440 , 430 - 438 are based on a sub-processor and need be executed thereby whereas, the remaining blocks having no shading are irrespective of sub-processors.
  • FIG. 4 shows the way in which software is compiled for use by the processor 22 .
  • the scenario compiler 348 of FIG. 3 is the scenario compiler block 416 of FIG. 4 and the scenario description block 418 , the adjust scenario description and optional optimization block 408 , and the processor-specific data block 422 serve as support for the scenario compiler block 416 .
  • the controller/compiler block 424 is the same as 350 of FIG. 3 .
  • the assembly code block 430 and the optimizing assembler 432 is the same as the optimizing assembler 352 of FIG. 3 and the low level assembler and linker block 434 of FIG. 4 is the same as the low level assembler 354 of FIG. 3 .
  • the description of these blocks will not be repeated.
  • the block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level.
  • the block 416 serves as a street map.
  • the adjust partitioning and kernels of FIG. 21 of the “PROGRAMMABLE PROCESSOR ARCHITECTURE” patent application incorporated herein by reference is the same as the block 408 .
  • the scenario description block 416 serves as input to the scenario compiler block 418 , as does the block 422 .
  • the output of the block 418 serves as input to the block 420 and the block 408 serves as input to the block 416 .
  • the block 416 describes inter-dependencies between the kernels 316 and applications 314 of FIG. 3 .
  • the SDL is used by the block 416 .
  • the optimizing block 410 is used to optimize high level code and assembly code.
  • the block 410 provides input to the blocks existing assembly code block 412 and the existing high level code block 414 .
  • the scenario compiler block 418 receives two sets of information, one is a bottom-up set of information and another is a top-down set of information.
  • An example of the former is the kernels 316 , i.e.
  • FFTs FFTs, DCTs, etc.
  • block 422 FFTs, DCTs, etc.
  • block 418 FFTs, DCTs, etc.
  • An example of the latter is provided by the block 416 , which is programmed in SDL to serve as control code and for defining inter-dependencies of the kernels and requirement of the application.
  • An example of the requirements of the application is the length of time that can be used for processing a frame of information other time-related requirements.
  • the block 418 once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule.
  • the schedule for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor.
  • the schedule information and synchronization information are provided by the block 420 , which receives input from the block 418 .
  • the output of the block 420 is provided as input to the block 424 . Having the block 420 receiving its input from the block 418 is generally not performed by prior art techniques due to their design/hardware limitations.
  • the hardware architecture based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after the block 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable.
  • the non-native compilation and simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, the native simulation block 440 is in native environment.
  • the block 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready.
  • an off-the-shelf compiler i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific.
  • the kernels 316 and the time consumed for executing control code compete.
  • the output of the existing assembly functions block 402 serves as input to the existing assembly code 412 , which also receives input from the block 410 .
  • the output of the block 412 serves as input to the block 432 , which also receives input from the block 430 .
  • the output of the block 432 serves as input to the block 434 and the block 434 serves as input to the block 436 , which provides input to the native simulation block 440 and the implementation complete block 438 .
  • the block 440 provides input to the a decision block 442 , which determines whether or not design goals have been made and if so, the block 438 is performed and if not, either the code is optimized by the block 410 or the scenario description is adjusted by the block 408 .
  • the output of the blocks 412 , 414 and 420 are all provided to the block 428 . That is, assembly code, high level code and schedule and synchronization code are all provided to the block 428 for execution or simulation.
  • the output of the block 428 is provided to a decision block 426 for determining whether or not design goals are met, if so, the process is exited, otherwise, the scenario may be adjusted by the block 408 .
  • the output of the block 414 is provided to the block 424 for compilation thereof and the output of the latter is provided to the block 430 .
  • optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the block 432 , high level code is optimized by the block 410 and SDL is optimized by the block 416 . This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code.
  • the block 424 receives high level code and compiles the same but output assembly code to the block 430 , which is optimized by the block 432 .
  • the output of the block 432 is provided to the block 434 for creation of still further low level code and the output of the block 434 is provided to the block 436 for generation of binary object code to be used by a sub-processor.
  • the assembly code that is written by the programmer is provided from the block 412 to the block 432 for assembling.
  • FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.
  • instructions 502 are shown to be provided to the block 434 , which will also be referred to as the latency verification block. That is, the current instruction is referred to as “instruction n”, while the instruction previous to the latter is referred to as “instruction n ⁇ 1”, while the instruction previous to the latter is referred to as “instruction n ⁇ 2” and the instruction previous to that is referred to as “instruction n ⁇ 3” and so on. That is, an n ⁇ 1 instruction is delayed by a program cycle from an n instruction and so on.
  • the block 434 performs various functions, shown in FIG. 5 , in blocks or diamond shapes. Such functions include identifying all instructions directly contributing to instruction n 508 , which is then used to determine latency rules relevant to instructions at 506 and it is provided to a decision block at 510 for identifying the earliest instruction. If the earliest instruction is identifiable, the process continues to 514 checking the latency rule against the register values latencies and if any error results, the process continues to 516 at which time an error is reported. If no error is detected at 514 , the next instruction is processed at 518 . The process of FIG. 5 is done during compilation and serves as check for the programmer.
  • rules are used to determine what the actual latencies are using a database of rules.
  • this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known.
  • worst case possibilities are determined.
  • latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516 . An example of the programmer's annotation is discussed hereinbelow.
  • a computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register.
  • strict read-after-write behavior For a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay.
  • processors For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register.
  • the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together.
  • assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
  • an assembly language program is annotated by the programmer as follows:
  • a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register.
  • the lack of an annotation is either an error or indicates a default assumption.
  • the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior).
  • an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n ⁇ 1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction
  • Block 502 has an instruction n together with a sequence of earlier instructions, as previously discussed above. Accordingly and given the hooks placed into the compiler and assembler of the present invention, programming is made simpler and programmer-friendly while or without reducing power.
  • Block 434 determines whether the latency annotations are correct for instruction n for this path, while block 520 performs the other usual functions of an assembler for instruction n.
  • block 508 the earlier instructions that contribute to the inputs of instruction n are identified.
  • Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512 ). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516 ).

Abstract

One embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.

Description

    CROSS REFERENCE TO RELATED APPLICATION
  • This application claims the benefit of U.S. Provisional Patent Application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture” and filed on Aug. 2, 2004 and is a continuation-in-part of U.S. patent application Ser. No. 11/180,068, filed on Jul. 12, 2005 and entitled “PROGRAMMABLE PROCESSOR ARCHITECTURE”, the disclosures of both of which are incorporated herein by reference as though set forth in full.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
  • 2. Description of the Prior Art
  • With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPods and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
  • One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
  • To provide specific examples for current processor problems, problems associated with RISCs, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
  • A. RISC/Super Scalar Processors
  • RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
  • Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
  • Pros:
      • Industry wide acceptance has lead to a more matured tool chain and wide software choices
      • A robust programming model has resulted from a very efficient automatic code generator used to generate binaries from high level languages like C.
      • Processors in the category are very good general purpose solutions.
      • Moore's Law can be effectively used for increasing performance.
  • Cons:
      • The general purpose nature of the architecture does not leverage common/specific characteristics of a set or sub-set of applications for better price, power and performance.
      • They consume moderate to high amounts of power with respect to the amount of computation provided.
      • Performance increase is mostly achieved at the expense of pipeline latency which adversely affects several multimedia and communication algorithms.
      • Complicated hardware scheduler, sophisticated control mechanisms and significantly reduced restrictions for more efficient automatic code generation for general algorithms have made this category of solutions less area efficient.
  • B. Very Long Instruction Word (VLIW) and DSPs
  • VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
  • Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
  • Pros:
      • Restricting the solution to the signal processing space improved 3P in comparison with RISC and Super Scalar architectures
      • VLIW architectures provide higher level of parallelism relative to RISC and superscalar architectures.
      • An efficient tool chain and industry wide acceptance was generated fairly rapidly.
      • Automatic code generation and programmability are showing significant improvements as more processors designed for signal processing fall into this category.
  • Cons:
      • Although problem solving capability is reduced to the digital signal processing space, it is too broad for a general solution like VLIW machine to have efficient 3P.
      • Control is both expensive and power consuming especially for primitive control code in many multimedia and communication applications.
      • Several power and area inefficient techniques were used to make automatic code generation easy. Strong reliance on these techniques by the software community is carrying forward this inefficiency from generation to generation.
      • VLIW architectures are not well suited for processing serial code.
  • C. Reconfigurable Computing
  • Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain FPGA like architectures.
  • Pros:
      • Some designs restricted to a specific application while providing needed flexibility within that application proved to be price, power, performance competitive
      • Research showed that such restricted yet flexible solutions can be created to address many application hotspots.
  • Cons:
      • Several designs in this space did not provide an efficient and easy programming solution and therefore was not widely accepted by a community adept in programming DSPs.
      • Automatic code generation from higher level languages like C was either virtually impossible or highly inefficient for many of the designs.
      • 3P advantage was lost when an attempt was made to combine heterogeneous applications using one type of interconnect and one level of granularity. Degree of utilization of the provided parallelism suffered heavily.
      • Reconfiguration overhead was significant in 3P for most designs.
      • In many cases, the external interface was complicated because the proprietary reconfigurable fabric did not match industry standard system design methodologies.
      • Reconfigurable machines are uni-processors and rely heavily on a tightly integrated RISC even for processing primitive control.
  • D. Array of Processors
  • Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
  • Pros:
      • Different processors optimized for different sets of applications when connected together using an efficient fabric can help solve a wide range of problems.
      • Uniform scaling model allows number processors to be connected together as performance requirements increase.
      • Complex algorithms can be efficiently partitioned.
  • Cons:
      • Although performance requirements may be adequately answered, power and price inefficiencies are too high.
      • The programming model varies from processor to processor. This makes the job of the application developer much harder.
      • Uniform scaling of multiple processors is a very expensive and power consuming resource. This has shown to display some non-determinism that may be detrimental to the performance of the entire system.
      • The programming model at the system level suffers from complexity of communicating data, code and control information without any shared memory resources—since shared memory is not uniformly scalable.
      • Extensive and repetitive glue logic required to connect different types of processors to a homogeneous network adds to the area inefficiencies, increases power and adds to the latency.
  • In light of the foregoing, there is a need for a low-power, inexpensive, efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.
  • SUMMARY OF THE INVENTION
  • Briefly, one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.
  • IN THE DRAWINGS
  • FIG. 1 shows an application 10 with reference to a digital product 12 including an embodiment of the present invention
  • FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
  • FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1.
  • FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
  • FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • A sub-processor (“CoolProcessor) is provided employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
  • As shown and described below with reference to FIG. 2, the sub-processor employs a heterogeneous interconnect between heterogeneous processors, designed to match multimedia and communications applications.
  • One embodiment of the present invention employs four sub-processors (referred to as “black boxes” or “processor” in the provisional application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture”). In this patent document, a processor 22 comprises a plurality of sub-processors. The four sub-processors are split inot two categories. The letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths. The CoolW sub-processor, however, will support wider rage data bits. The sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz. The floating-point instruction set includes addition, subtraction, and multiplication.
  • The letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications. Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type). The internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
  • A control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor. The control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job. A rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it. Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
  • A general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications). The GPP includes its own instruction and data memory or cache.
  • The interconnect is based on the Sonics “smart” SoC bus. An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
  • Referring now to FIG. 1, an application 10 is shown with reference to a digital product 12 including an embodiment of the present invention. FIG. 1 is intended to provide the reader with a perspective regarding some, but not necessarily all, of the advantages of a product, which includes an embodiment of the present invention relative to those available in the marketplace.
  • Accordingly, the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14, digital camera device 16, digital recording or music device 18 and PDA device 20. The product 12 is capable of executing one or more of the functions of the devices 14-20 simultaneously yet utilizing less power.
  • The product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14-20. It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
  • FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention. Further shown in FIG. 2, the processor 22 is coupled to interface circuit 26 through a general purpose bus 30 and to the interface circuit 28 through a general purpose bus 31 and further coupled, through the bus 30, to a general purpose processor 32 through the bus 31. The circuit 20 is further shown to include a clock reset and power management 34 for generating a clock utilized by the remaining circuits of the circuit 10, a reset signal utilized in the same manner and circuitry for managing power by the same. There is further included in the circuit 20, a Joint Test Action Group (JTAG) circuit 36. JTAG is used as a standard for testing chips.
  • The interface circuit 26 shown coupled to the bus 30 and interface circuit 28, shown coupled to the bus 31, include the blocks 40-66, which are generally known to those of ordinary skill in the art and used by current processors.
  • The processor 22, which is a heterogeneous multi-processor, is shown to include shared data memory 70, shared data memory 72, a CoolW sub-processor (or block) 74, a CoolW sub-processor (or block) 76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82, the CoolW block 76 has associated therewith an instruction memory 84, CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88. Similarly, each of the blocks 74-80 has associated therewith a control block. The block 74 has associated therewith a control block 90, the block 76 has associated therewith a control block 92, the block 78 has associated therewith a control block 94 and the block 80 has associated therewith a control circuit 96. The block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications, whereas, the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
  • The blocks 74-80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22. Furthermore, the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74-80 resulting in the lowest latency path through the sub-processor to which it is coupled. In FIG. 2, the circuit 24 is shown directly coupled to the block 76 although it may be coupled to any of the blocks 74, 78 or 80. Higher priority agents or tasks may be assigned to the block which is directly coupled to the circuit 24.
  • It should be noted that while four blocks 74-80 are shown, other number of blocks may be utilized, however, utilizing additional blocks clearly results in additional die space and higher manufacturing costs.
  • Complicated applications requiring great processing power are not scattered in the circuit 20, rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
  • The circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type. W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing. N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
  • Different applications require different performance or processing capabilities and are thus, executed by a different type of block or sub-processor. Take for instance, applications that are typically executed by DSPs, they would be generally be processed by W type sub-processors, such as the blocks 74 or 76 of FIG. 2 because they characteristically include commonly occurring DSP kernels. Such applications include, but are not limited to, fast fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite impulse response (FIR) filters, Discrete Cosine transform (DCT) or inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D Bilinear Texture Mapping, Gouraud Shading, Golay Correlation, Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending, Higher-Order Surface Tessellation, Vertex Shade (Trans/Light), Triangle Setup, Full-Screen Anti-aliasing and Quantization.
  • Other commonly occurring DSP kernels can be executed by N type sub-processors, such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
  • Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches. The sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture. Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
  • Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22, as presented below.
  • “Quasi-Adiabatic Programmable” or Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) Processors. In term of thermodynamics, Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic. The various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
  • The integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors. Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously. Due to achieving simultaneous application execution on the integrated circuit 20, which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of FIG. 1. Additionally, the processor 22 offers a single programmable framework to a user to implement multiple functions, such as multimedia complex applications. Of important value is the ability of the integrated circuit 20 and namely, the processor 22, to support future standards adopted by the industry, which are expected to be of greater complexity than that of today's standards.
  • Each of the blocks 74-80 can execute only one sequence (or stream) of programs at a given time. A sequence of program is referred to a function associated with a particular application. For example, FFT is a type of sequence. However, different sequences may be dependent on one another. For example, an FFT program, once completed may store its results in the memory 70 and the next sequence, may then use the stored result. Different sequences sharing information in this manner or being dependent upon each other in this manner is referred to as “stream flow”.
  • In FIG. 2, the memories 70 and 72 each include 8 blocks of 16 kilobytes of memory, however, in other embodiments, different size memory may be utilized.
  • The instruction memories 82, 84, 86 and 88 are used to store instructions for execution by the blocks 74-80, respectively.
  • FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1. The hardware components 304 includes the processor 32, the circuit 26 and circuit 28 and the processor 22, as described and shown with respect to previous figures.
  • Included within the software architecture 302, a hardware abstraction layer or low level drivers 306 and an operating systems driver 308 cause interfacing or communication between the hardware components 304 and the software architecture 302. The software architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to the hardware components 304 and to a scenario 312, which is for causing multiple applications 314 to be executed, each application 314 including kernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry. The software architecture 302 is further shown to include a system level software changes scenarios 318, which is shown to communicate with an operating systems interface (OSI) 322 and an operating system 320. The operating system 320 is further shown to communicate with the scenario 312, applications 314, and kernals 316. the kernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code.
  • Each of the applications 314 includes many kernels, such as the kernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application. The scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario. The scenario 318 causes scenarios to be changed while running on the hardware 304. From a software perspective, each of the kernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while the scenario 312 and each of the applications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly. For now, suffice it to say that the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture of FIG. 2, causes simultaneous or concurrent execution of applications, in a hierarchical manner and while maintaining low power consumption.
  • The CoolBios 310 includes a set of software functions that allow input and output communication with the processor 22 and eliminates the need for a full operating system running on the processor 22.
  • The hardware component 304 and software architecture 302 provide an environment to load and execute a multi-application scenario. A “scenario”, as referred to herein, is a set of applications, such as the applications 314, executing concurrently. Some examples of each of the applications 314, as shown in FIG. 3, include but are not limited to JPEG, MP3, H.264 and 802.11g. A scenario 312 interfaces with the operating system 320 and higher-level software through the OSI 322 and the drivers 308.
  • The software architecture 302 and the hardware components 304 of FIG. 3 allow an operating system (OS) to be loaded onto the processor 32 and the drivers 308 to ultimately allow a scenario 312 to be loaded for causing multiple applications to be executed concurrently
  • The scenario 312 includes information, in its header, overhead information, to cause turning on or off each of the different applications 314. For example, the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off. Remaining processing power, i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed.
  • Essentially, there are three modes of operation within the software architecture 302. One is real-time mode, an example of which is 802.11g, which has hardware time constraints. In this case, it is not feasible to add another application because a scenario 312 that includes an 802.11g application has compiled the latter and in the presence of a pre-complied application, a new application cannot be added. Generally, in the presence of applications having a timing constraint, a new application is not readily added or to dynamically change scenarios because it disturbs the processing balance, however, this is not an issue in mobile applications because scenarios are not readily changed in such applications.
  • The scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off. The pre-compiled and scheduled scenario 312, which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74. Turning off an application prevents “choking” of the system, that is, bandwidth is improved.
  • The system level software changes scenarios 318 causes changing of the scenario 312, which, as previously-stated, may be done dynamically. The code in the latter is in “C” or a high level code. The scenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
  • On the right-hand side of FIG. 3, the hierarchical software/compiler characteristics of the architecture 300 is shown, in conceptual form, and in reference to a software tools hierarchical column 340, a hierarchical level column 342 and a hardware hierarchical levels column 344. Each of the pieces of software of the software architecture 302 is taken through a different tool so as to avoid a flat methodology.
  • The drivers 306 and 308 are used as tools for the general purpose processor (GPP) 32 on the highest level of the tool column 340 while, in the next level of the hierarchal tools, a scenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors. The kernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another.
  • The number of threads is limited to the number of sub-processors. The way in which applications are handed from one kernel to another is by the kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then another kernel 316 utilizing the stored information in shared memory to perform another function. A synchronization code is used for this hand-off, which is done by the scenario 312 and the particular tool is the scenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread.
  • In the next level of the tool hierarchy, as shown in the column 340, a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizing assembler 352 and a low level assembler 354. The goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention. The compiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed.
  • By changing scenarios, multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously. The ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
  • By way of example, if a manufacturer introduces a product, such as a PDA, this is compiled along with other applications, such as a digital camera or MP3, etc., and a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention. Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
  • In FIG. 3, column 342 states which part or component of the hardware are utilized for the corresponding tool of column 340. That is, viewed in a row, each location of the row within column 342 corresponds to a like-location in column 340. Thus, the GPP IDE 346 is handled by the hardware 304, the scenario compiler 348 is handled by the processor 22 and the compiler 350 is handled by a control block of one of the sub-processor and the assemblers 352 and 354 are handled by one or more of the sub-processors, such as the sub-processor 72. The particular hardware hierarchical levels are correspondingly enumerated in column 344 by reference.
  • The low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizing assembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling. The hierarchical flow of column 340 and the hardware architecture of the processor of FIG. 2 allow for an efficient, low power and flexible processing tool. In prior art, while the assembler 354 and the GPP IDE 346 are used, the remainder of the column 340 are not.
  • With continued reference to FIG. 3, an hierarchical compilation involves partitioning the application code into a general purpose processor component that allows interaction between the processor 32 and all other hardware components. This general-purpose processor component is mainly to allow switching between different scenarios.
  • Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory. The scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time. The compiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90) that execute application control code and the scenario control and synchronization code.
  • The optimizing assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor.
  • The scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources. The scenario compiler uses scheduling algorithms from the existing art to create the schedule. The scenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler. The scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device. The scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors. The scenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to the scenario compiler 348.
  • The SDL allows for a collection of functionality used in the present invention. The Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within. SDL is compact, human-readable, and scalable. SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
  • FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention. In FIG. 4, the blocks that are shown shaded, such as blocks 402, 412, 422, 420, 424, 428, 440, 430-438 are based on a sub-processor and need be executed thereby whereas, the remaining blocks having no shading are irrespective of sub-processors.
  • Generally, FIG. 4 shows the way in which software is compiled for use by the processor 22. In FIG. 4, there is shown further details of some of the blocks of FIG. 3. Specifically, the scenario compiler 348 of FIG. 3 is the scenario compiler block 416 of FIG. 4 and the scenario description block 418, the adjust scenario description and optional optimization block 408, and the processor-specific data block 422 serve as support for the scenario compiler block 416. The controller/compiler block 424 is the same as 350 of FIG. 3. The assembly code block 430 and the optimizing assembler 432 is the same as the optimizing assembler 352 of FIG. 3 and the low level assembler and linker block 434 of FIG. 4 is the same as the low level assembler 354 of FIG. 3. Thus, the description of these blocks will not be repeated.
  • The block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level. The block 416 serves as a street map. The adjust partitioning and kernels of FIG. 21 of the “PROGRAMMABLE PROCESSOR ARCHITECTURE” patent application incorporated herein by reference is the same as the block 408.
  • The scenario description block 416 serves as input to the scenario compiler block 418, as does the block 422. The output of the block 418 serves as input to the block 420 and the block 408 serves as input to the block 416. The block 416 describes inter-dependencies between the kernels 316 and applications 314 of FIG. 3. The SDL is used by the block 416. The optimizing block 410 is used to optimize high level code and assembly code. Thus, the block 410 provides input to the blocks existing assembly code block 412 and the existing high level code block 414. The scenario compiler block 418 receives two sets of information, one is a bottom-up set of information and another is a top-down set of information. An example of the former is the kernels 316, i.e. FFTs, DCTs, etc., which is provided by the block 422 to the block 418 and it is assembled, optionally, with power information during assembly. An example of the latter is provided by the block 416, which is programmed in SDL to serve as control code and for defining inter-dependencies of the kernels and requirement of the application. An example of the requirements of the application is the length of time that can be used for processing a frame of information other time-related requirements.
  • The block 418, once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule. The schedule, for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor. The schedule information and synchronization information are provided by the block 420, which receives input from the block 418. The output of the block 420 is provided as input to the block 424. Having the block 420 receiving its input from the block 418 is generally not performed by prior art techniques due to their design/hardware limitations. That is, the hardware architecture, based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after the block 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable.
  • The non-native compilation and simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, the native simulation block 440 is in native environment. The block 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready. Thus, an off-the-shelf compiler, i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific. The kernels 316 and the time consumed for executing control code compete.
  • In FIG. 4, the output of the existing assembly functions block 402 serves as input to the existing assembly code 412, which also receives input from the block 410. The output of the block 412 serves as input to the block 432, which also receives input from the block 430. The output of the block 432 serves as input to the block 434 and the block 434 serves as input to the block 436, which provides input to the native simulation block 440 and the implementation complete block 438. The block 440 provides input to the a decision block 442, which determines whether or not design goals have been made and if so, the block 438 is performed and if not, either the code is optimized by the block 410 or the scenario description is adjusted by the block 408. The output of the blocks 412, 414 and 420 are all provided to the block 428. That is, assembly code, high level code and schedule and synchronization code are all provided to the block 428 for execution or simulation. The output of the block 428 is provided to a decision block 426 for determining whether or not design goals are met, if so, the process is exited, otherwise, the scenario may be adjusted by the block 408. The output of the block 414 is provided to the block 424 for compilation thereof and the output of the latter is provided to the block 430.
  • Optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the block 432, high level code is optimized by the block 410 and SDL is optimized by the block 416. This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code. The block 424 receives high level code and compiles the same but output assembly code to the block 430, which is optimized by the block 432. The output of the block 432 is provided to the block 434 for creation of still further low level code and the output of the block 434 is provided to the block 436 for generation of binary object code to be used by a sub-processor. The assembly code that is written by the programmer is provided from the block 412 to the block 432 for assembling.
  • FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention. In FIG. 5, instructions 502 are shown to be provided to the block 434, which will also be referred to as the latency verification block. That is, the current instruction is referred to as “instruction n”, while the instruction previous to the latter is referred to as “instruction n−1”, while the instruction previous to the latter is referred to as “instruction n−2” and the instruction previous to that is referred to as “instruction n−3” and so on. That is, an n−1 instruction is delayed by a program cycle from an n instruction and so on.
  • The block 434 performs various functions, shown in FIG. 5, in blocks or diamond shapes. Such functions include identifying all instructions directly contributing to instruction n 508, which is then used to determine latency rules relevant to instructions at 506 and it is provided to a decision block at 510 for identifying the earliest instruction. If the earliest instruction is identifiable, the process continues to 514 checking the latency rule against the register values latencies and if any error results, the process continues to 516 at which time an error is reported. If no error is detected at 514, the next instruction is processed at 518. The process of FIG. 5 is done during compilation and serves as check for the programmer.
  • At 506, rules are used to determine what the actual latencies are using a database of rules. At 510, this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known. At 512, worst case possibilities are determined. At 514, latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516. An example of the programmer's annotation is discussed hereinbelow.
  • A computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register. To achieve strict read-after-write behavior for a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay. The advantages of implementing strict read-after-write behavior for all registers are:
      • (1) The same sequence of instructions can execute correctly on a wider range of processor implementations, and
      • (2) Assembly language programming is made easier.
  • For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register. Although the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together. For processors of this kind, unfortunately, assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
  • Latency Verification:
  • In FIG. 5, a process is described by which an assembler or other programming tool can verify that a processor's latency rules are being applied correctly by the programmer. First, an assembly language program is annotated by the programmer as follows:
  • For each register read by each instruction, a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register. The lack of an annotation is either an error or indicates a default assumption. For example, the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior). Whenever the programmer expects a value different from the default assumption, an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n−1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction
      • add r1, $$r2, $r3
        would indicate that the programmer expects the value obtained for register r3 to be the value written by the second previous instruction to write r3, and the value obtained for register r2 to be the value written by the third previous instruction to write r2. In the above example, the current value of register r1 and two values ago of register r2 and the previous value of register r3 are being added. The assembler or block 434 checks to ensure that all of these values are available by performing the process of FIG. 5. It should be noted that the annotation need not be a dollar sign, rather, it can be any notation.
  • Given these annotations, for each instruction, the assembler or other programming tool automatically determines whether the programmer's expectations are correct, by examining the sequences of instructions that can execute previous to the given instruction along all paths leading to the given instruction, and applying the documented latency rules to these sequences. FIG. 5 shows one possible manifestation of this process as part of a modified assembler. Block 502 has an instruction n together with a sequence of earlier instructions, as previously discussed above. Accordingly and given the hooks placed into the compiler and assembler of the present invention, programming is made simpler and programmer-friendly while or without reducing power.
  • Block 434 determines whether the latency annotations are correct for instruction n for this path, while block 520 performs the other usual functions of an assembler for instruction n. In block 508, the earlier instructions that contribute to the inputs of instruction n are identified. Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516).
  • Although the present invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.

Claims (14)

1. A software architecture for execution on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:
a scenario compiler for pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code, the scenario compiler including a plurality of applications, each application including one or more kernels, the scenario compiler pre-compiling the scenario for efficient execution thereof by a plurality of sub-processors, each sub-processor including a control circuit including high level code for execution thereof, the control circuit is a high language programmable controller for the sub-processor,
wherein a hierarchical compilation of different types of programming codes allow for efficient binary code creating while reducing power consumption when the binary code is executed by the sub-processors.
2. A software architecture, as recited in claim 1, further including a schedule and synchronization block communicating with the scenario compiler and for generating code, based on scenario description language (SDL) to operate with one or more of the sub-processors.
3. A software architecture, as recited in claim 2, further including a high level language compiler block receiving input from the synchronization block for compiling high level code.
4. A software architecture, as recited in claim 3, further including an assembler block coupled to receive information from the high level language compiler block and from an assembly code block, which provides assembly code written by a user, the assembler block for assembling the assembly code and the information received from the high level language compiler block.
5. A software architecture, as recited in claim 4, further including a binary code block for generating binary code based on assembly code, high level code and SDL.
6. A software architecture, as recited in claim 5, further including a scenario description and optional optimization block coupled to the scenario description block and upon the generation of binary code, a user's design goals are verified and if the design goals are not met, the scenario description and optional optimization block modifies the scenario.
7. A software architecture, as recited in claim 6, wherein the sub-processors each include applications having kernels, the kernels being engines for execution of computationally intensive code.
8. A software architecture, as recited in claim 7, further including a scenario description block coupled to the scenario compiler block for generating SDL for describing inter-dependencies between the kernals.
9. A software architecture, as recited in claim 8, further including a low-level assembler and linker block coupled to the optimizing assembler block for assembling the lowest-level code.
10. A software architecture, as recited in claim 9, wherein the low-level assembler and linker block further includes a latency verification block responsive to an N number of previous instructions and a current instruction for verifying the presence of N number of previous instructions used by a user for instructions requiring previous instructions.
11. A software architecture, as recited in claim 10, wherein the latency verification block for verifying the user's instruction, which includes use of previous instructions, against latency rules.
12. A software architecture, as recited in claim 11, further including shared memory coupled to the sub-processors wherein the kernel of one of the sub-processors hands off to another sub-processor by placing, in the shared memory, information to be used by the another sub-processor.
13. A method of generating and executing code on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:
pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code;
generating efficient binary code to be executed by the sub-processors based on applications including kernels, the kernels for executing computationally intensive code, the execution of the binary code by the sub-processors causing reduction of power consumption and flexible coding options to a user.
14. A method of generating and executing code, as recited in claim 13, further including performing latency verification to prevent a user from using erroneous previous instructions.
US11/195,429 2004-08-02 2005-08-02 Programmable processor architecture hirarchical compilation Abandoned US20060026578A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/195,429 US20060026578A1 (en) 2004-08-02 2005-08-02 Programmable processor architecture hirarchical compilation

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US59841704P 2004-08-02 2004-08-02
US11/180,068 US7721069B2 (en) 2004-07-13 2005-07-12 Low power, high performance, heterogeneous, scalable processor architecture
US11/195,429 US20060026578A1 (en) 2004-08-02 2005-08-02 Programmable processor architecture hirarchical compilation

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/180,068 Continuation-In-Part US7721069B2 (en) 2004-07-13 2005-07-12 Low power, high performance, heterogeneous, scalable processor architecture

Publications (1)

Publication Number Publication Date
US20060026578A1 true US20060026578A1 (en) 2006-02-02

Family

ID=35733871

Family Applications (5)

Application Number Title Priority Date Filing Date
US11/180,068 Active 2025-09-28 US7721069B2 (en) 2004-07-13 2005-07-12 Low power, high performance, heterogeneous, scalable processor architecture
US11/195,429 Abandoned US20060026578A1 (en) 2004-08-02 2005-08-02 Programmable processor architecture hirarchical compilation
US11/848,023 Abandoned US20070294511A1 (en) 2004-08-02 2007-08-30 Programmable Processor Architecture
US12/782,431 Active 2025-10-22 US8386751B2 (en) 2004-07-13 2010-05-18 Low power, high performance, heterogeneous, scalable processor architecture
US13/775,402 Active US8880850B2 (en) 2004-07-13 2013-02-25 Low power, high performance, heterogeneous, scalable processor architecture

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/180,068 Active 2025-09-28 US7721069B2 (en) 2004-07-13 2005-07-12 Low power, high performance, heterogeneous, scalable processor architecture

Family Applications After (3)

Application Number Title Priority Date Filing Date
US11/848,023 Abandoned US20070294511A1 (en) 2004-08-02 2007-08-30 Programmable Processor Architecture
US12/782,431 Active 2025-10-22 US8386751B2 (en) 2004-07-13 2010-05-18 Low power, high performance, heterogeneous, scalable processor architecture
US13/775,402 Active US8880850B2 (en) 2004-07-13 2013-02-25 Low power, high performance, heterogeneous, scalable processor architecture

Country Status (2)

Country Link
US (5) US7721069B2 (en)
WO (1) WO2006017482A2 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011199A1 (en) * 2005-06-20 2007-01-11 Microsoft Corporation Secure and Stable Hosting of Third-Party Extensions to Web Services
US20080082786A1 (en) * 2006-10-02 2008-04-03 William Stuart Lovell Super-scalable, continuous flow instant logic™ binary circuitry actively structured by code-generated pass transistor interconnects
US20080215768A1 (en) * 2006-10-24 2008-09-04 Alastair David Reid Variable coherency support when mapping a computer program to a data processing apparatus
US20080244507A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Homogeneous Programming For Heterogeneous Multiprocessor Systems
US20100042751A1 (en) * 2007-11-09 2010-02-18 Kouichi Ishino Data transfer control device, data transfer device, data transfer control method, and semiconductor integrated circuit using reconfigured circuit
US20100162185A1 (en) * 2005-08-12 2010-06-24 University Of Sussex Electronic circuit design
US20100211760A1 (en) * 2009-02-18 2010-08-19 Egger Bernhard Apparatus and method for providing instruction for heterogeneous processor
US20130036408A1 (en) * 2011-08-02 2013-02-07 International Business Machines Corporation Technique for compiling and running high-level programs on heterogeneous computers
US20130185704A1 (en) * 2012-01-18 2013-07-18 International Business Machines Corporation Providing performance tuned versions of compiled code to a cpu in a system of heterogeneous cores
US8582654B1 (en) * 2009-09-30 2013-11-12 Hewlett-Packard Development Company, L.P. Generating a deblocked version of video frames using motion estimation
US8789063B2 (en) 2007-03-30 2014-07-22 Microsoft Corporation Master and subordinate operating system kernels for heterogeneous multiprocessor systems
US9652817B2 (en) 2015-03-12 2017-05-16 Samsung Electronics Co., Ltd. Automated compute kernel fusion, resizing, and interleave
US11755382B2 (en) * 2017-11-03 2023-09-12 Coherent Logix, Incorporated Programming flow for multi-processor system

Families Citing this family (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9330060B1 (en) 2003-04-15 2016-05-03 Nvidia Corporation Method and device for encoding and decoding video image data
US8660182B2 (en) * 2003-06-09 2014-02-25 Nvidia Corporation MPEG motion estimation based on dual start points
US7721069B2 (en) * 2004-07-13 2010-05-18 3Plus1 Technology, Inc Low power, high performance, heterogeneous, scalable processor architecture
US8731071B1 (en) * 2005-12-15 2014-05-20 Nvidia Corporation System for performing finite input response (FIR) filtering in motion estimation
US8724702B1 (en) 2006-03-29 2014-05-13 Nvidia Corporation Methods and systems for motion estimation used in video coding
US8660380B2 (en) 2006-08-25 2014-02-25 Nvidia Corporation Method and system for performing two-dimensional transform on data value array with reduced power consumption
US7463465B2 (en) * 2006-12-28 2008-12-09 General Electric Company Series arc fault current interrupters and methods
US8756482B2 (en) 2007-05-25 2014-06-17 Nvidia Corporation Efficient encoding/decoding of a sequence of data frames
US8726241B1 (en) * 2007-06-06 2014-05-13 Rockwell Collins, Inc. Method and system for the development of high-assurance computing elements
US9118927B2 (en) 2007-06-13 2015-08-25 Nvidia Corporation Sub-pixel interpolation and its application in motion compensated encoding of a video signal
KR101445794B1 (en) * 2007-06-18 2014-11-03 삼성전자주식회사 Programmable device for software defined radio terminal
US8873625B2 (en) 2007-07-18 2014-10-28 Nvidia Corporation Enhanced compression in representing non-frame-edge blocks of image frames
US8065506B2 (en) * 2007-08-17 2011-11-22 Texas Instruments Incorporated Application specific instruction set processor for digital radio processor receiving chain signal processing
US9081901B2 (en) * 2007-10-31 2015-07-14 Raytheon Company Means of control for reconfigurable computers
US20090171603A1 (en) * 2007-12-28 2009-07-02 Sriram Changali Methods of detecting series arcs in electrical signals
US8054591B2 (en) * 2008-07-24 2011-11-08 General Electric Company Arc detection using discrete wavelet transforms
WO2010016169A1 (en) * 2008-08-07 2010-02-11 日本電気株式会社 Multiprocessor system and method for controlling the same
US8755515B1 (en) 2008-09-29 2014-06-17 Wai Wu Parallel signal processing system and method
US8666181B2 (en) 2008-12-10 2014-03-04 Nvidia Corporation Adaptive multiple engine image motion detection system and method
US8159793B2 (en) * 2008-12-22 2012-04-17 General Electric Company Arc detection using detailed and approximate coefficients from discrete wavelet transforms
US8170816B2 (en) 2008-12-29 2012-05-01 General Electric Company Parallel arc detection using discrete wavelet transforms
CN102804165A (en) * 2009-02-11 2012-11-28 四次方有限公司 Front end processor with extendable data path
US8839214B2 (en) * 2010-06-30 2014-09-16 Microsoft Corporation Indexable type transformations
US20120084539A1 (en) * 2010-09-29 2012-04-05 Nyland Lars S Method and sytem for predicate-controlled multi-function instructions
WO2012052774A2 (en) * 2010-10-21 2012-04-26 Bluwireless Technology Limited Data processing units
US8683178B2 (en) * 2011-01-21 2014-03-25 Apple Inc. Sharing a fault-status register when processing vector instructions
US8533512B2 (en) * 2011-02-10 2013-09-10 International Business Machines Corporation Dynamic power and performance calibration of data processing systems
US9569771B2 (en) 2011-04-29 2017-02-14 Stephen Lesavich Method and system for storage and retrieval of blockchain blocks using galois fields
US9361479B2 (en) 2011-04-29 2016-06-07 Stephen Lesavich Method and system for electronic content storage and retrieval using Galois fields and geometric shapes on cloud computing networks
US9137250B2 (en) 2011-04-29 2015-09-15 Stephen Lesavich Method and system for electronic content storage and retrieval using galois fields and information entropy on cloud computing networks
US9037564B2 (en) 2011-04-29 2015-05-19 Stephen Lesavich Method and system for electronic content storage and retrieval with galois fields on cloud computing networks
KR101918464B1 (en) * 2011-09-14 2018-11-15 삼성전자 주식회사 A processor and a swizzle pattern providing apparatus based on a swizzled virtual register
US8959419B1 (en) * 2011-12-20 2015-02-17 Marvell International Ltd. Viterbi architecture for BDR/BLE
CN107220029B (en) 2011-12-23 2020-10-27 英特尔公司 Apparatus and method for mask permute instruction
WO2013095637A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved permute instructions
WO2013095630A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved extract instructions background
CN111831335A (en) 2011-12-23 2020-10-27 英特尔公司 Apparatus and method for improved insertion of instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
US8762808B2 (en) 2012-02-22 2014-06-24 Lsi Corporation Multi-processing architecture for an LTE turbo decoder (TD)
US20130311753A1 (en) * 2012-05-19 2013-11-21 Venu Kandadai Method and device (universal multifunction accelerator) for accelerating computations by parallel computations of middle stratum operations
US9292298B2 (en) 2013-07-08 2016-03-22 Arm Limited Data processing apparatus having SIMD processing circuitry
US9280845B2 (en) * 2013-12-27 2016-03-08 Qualcomm Incorporated Optimized multi-pass rendering on tiled base architectures
US9672043B2 (en) 2014-05-12 2017-06-06 International Business Machines Corporation Processing of multiple instruction streams in a parallel slice processor
US9665372B2 (en) 2014-05-12 2017-05-30 International Business Machines Corporation Parallel slice processor with dynamic instruction stream mapping
US9760375B2 (en) 2014-09-09 2017-09-12 International Business Machines Corporation Register files for storing data operated on by instructions of multiple widths
US9720696B2 (en) 2014-09-30 2017-08-01 International Business Machines Corporation Independent mapping of threads
US9977678B2 (en) 2015-01-12 2018-05-22 International Business Machines Corporation Reconfigurable parallel execution and load-store slice processor
US10133576B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries
US10133581B2 (en) 2015-01-13 2018-11-20 International Business Machines Corporation Linkable issue queue parallel execution slice for a processor
US9983875B2 (en) 2016-03-04 2018-05-29 International Business Machines Corporation Operation of a multi-slice processor preventing early dependent instruction wakeup
US10037211B2 (en) 2016-03-22 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor with an expanded merge fetching queue
US10346174B2 (en) 2016-03-24 2019-07-09 International Business Machines Corporation Operation of a multi-slice processor with dynamic canceling of partial loads
US10761854B2 (en) 2016-04-19 2020-09-01 International Business Machines Corporation Preventing hazard flushes in an instruction sequencing unit of a multi-slice processor
US10037229B2 (en) 2016-05-11 2018-07-31 International Business Machines Corporation Operation of a multi-slice processor implementing a load/store unit maintaining rejected instructions
US9934033B2 (en) 2016-06-13 2018-04-03 International Business Machines Corporation Operation of a multi-slice processor implementing simultaneous two-target loads and stores
US10042647B2 (en) 2016-06-27 2018-08-07 International Business Machines Corporation Managing a divided load reorder queue
US10318419B2 (en) 2016-08-08 2019-06-11 International Business Machines Corporation Flush avoidance in a load store unit
CN106681767B (en) * 2016-12-29 2020-07-10 广州华多网络科技有限公司 Light application adding method and device
US10387298B2 (en) 2017-04-04 2019-08-20 Hailo Technologies Ltd Artificial neural network incorporating emphasis and focus techniques
US11544545B2 (en) 2017-04-04 2023-01-03 Hailo Technologies Ltd. Structured activation based sparsity in an artificial neural network
US11238334B2 (en) 2017-04-04 2022-02-01 Hailo Technologies Ltd. System and method of input alignment for efficient vector operations in an artificial neural network
US11615297B2 (en) 2017-04-04 2023-03-28 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network compiler
US11551028B2 (en) 2017-04-04 2023-01-10 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network
WO2019039113A1 (en) * 2017-08-24 2019-02-28 ソニーセミコンダクタソリューションズ株式会社 Processor and information processing system
US20210173662A1 (en) * 2019-12-05 2021-06-10 International Business Machines Corporation Processor unit for multiply and accumulate operations
US11811421B2 (en) 2020-09-29 2023-11-07 Hailo Technologies Ltd. Weights safety mechanism in an artificial neural network processor
US11263077B1 (en) 2020-09-29 2022-03-01 Hailo Technologies Ltd. Neural network intermediate results safety mechanism in an artificial neural network processor
US11237894B1 (en) 2020-09-29 2022-02-01 Hailo Technologies Ltd. Layer control unit instruction addressing safety mechanism in an artificial neural network processor
US11874900B2 (en) 2020-09-29 2024-01-16 Hailo Technologies Ltd. Cluster interlayer safety mechanism in an artificial neural network processor
US11221929B1 (en) 2020-09-29 2022-01-11 Hailo Technologies Ltd. Data stream fault detection mechanism in an artificial neural network processor
US20230176863A1 (en) * 2021-12-03 2023-06-08 Taiwan Semiconductor Manufacturing Company, Ltd. Memory interface

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5878085A (en) * 1997-08-15 1999-03-02 Sicom, Inc. Trellis coded modulation communications using pilot bits to resolve phase ambiguities
US5909559A (en) * 1997-04-04 1999-06-01 Texas Instruments Incorporated Bus bridge device including data bus of first width for a first processor, memory controller, arbiter circuit and second processor having a different second data width
US6166748A (en) * 1995-11-22 2000-12-26 Nintendo Co., Ltd. Interface for a high performance low cost video game system with coprocessor providing high speed efficient 3D graphics and digital audio signal processing
US20020108104A1 (en) * 2000-09-13 2002-08-08 Xueshu Song Certification and manual compiling wizard
US20030014742A1 (en) * 2001-07-09 2003-01-16 Sasken Communication Technologies Limited Technique for compiling computer code to reduce energy consumption while executing the code
US20040078411A1 (en) * 2002-10-22 2004-04-22 Joshua Porten Galois field arithmetic unit for use within a processor
US20040123276A1 (en) * 2002-12-19 2004-06-24 Paul Knueven Assembly directives for the support of multi-language programming
US20070044079A1 (en) * 2005-06-02 2007-02-22 Tharas Systems Inc. A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69124437T2 (en) * 1990-08-09 1997-07-03 Silicon Graphics Inc Method and device for reversing byte order in a computer
KR100248903B1 (en) 1992-09-29 2000-03-15 야스카와 히데아키 System and method for handling load and/or store operating in a superscalar microprocessor
US5867726A (en) * 1995-05-02 1999-02-02 Hitachi, Ltd. Microcomputer
US5987561A (en) 1995-08-31 1999-11-16 Advanced Micro Devices, Inc. Superscalar microprocessor employing a data cache capable of performing store accesses in a single clock cycle
US5784602A (en) * 1996-10-08 1998-07-21 Advanced Risc Machines Limited Method and apparatus for digital signal processing for integrated circuit architecture
US5951677A (en) * 1998-05-29 1999-09-14 Texas Instruments Incorporated Efficient hardware implementation of euclidean array processing in reed-solomon decoding
US6247158B1 (en) * 1998-11-30 2001-06-12 Itt Manufacturing Enterprises, Inc. Digital broadcasting system and method
US6282631B1 (en) * 1998-12-23 2001-08-28 National Semiconductor Corporation Programmable RISC-DSP architecture
US6832306B1 (en) * 1999-10-25 2004-12-14 Intel Corporation Method and apparatus for a unified RISC/DSP pipeline controller for both reduced instruction set computer (RISC) control instructions and digital signal processing (DSP) instructions
US6539467B1 (en) 1999-11-15 2003-03-25 Texas Instruments Incorporated Microprocessor with non-aligned memory access
US6732203B2 (en) * 2000-01-31 2004-05-04 Intel Corporation Selectively multiplexing memory coupling global bus data bits to narrower functional unit coupling local bus
JP2001338492A (en) * 2000-05-26 2001-12-07 Matsushita Electric Ind Co Ltd Semiconductor device and control method
US6922716B2 (en) * 2001-07-13 2005-07-26 Motorola, Inc. Method and apparatus for vector processing
US6986092B2 (en) * 2001-12-04 2006-01-10 Qualcomm Inc. Erasure-and-single-error correction decoder for linear block codes
US7360055B2 (en) * 2003-02-10 2008-04-15 Broadcom Corporation Two address map for transactions between an X-bit processor and a Y-bit wide memory
US7721069B2 (en) * 2004-07-13 2010-05-18 3Plus1 Technology, Inc Low power, high performance, heterogeneous, scalable processor architecture

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6166748A (en) * 1995-11-22 2000-12-26 Nintendo Co., Ltd. Interface for a high performance low cost video game system with coprocessor providing high speed efficient 3D graphics and digital audio signal processing
US5909559A (en) * 1997-04-04 1999-06-01 Texas Instruments Incorporated Bus bridge device including data bus of first width for a first processor, memory controller, arbiter circuit and second processor having a different second data width
US5878085A (en) * 1997-08-15 1999-03-02 Sicom, Inc. Trellis coded modulation communications using pilot bits to resolve phase ambiguities
US20020108104A1 (en) * 2000-09-13 2002-08-08 Xueshu Song Certification and manual compiling wizard
US20030014742A1 (en) * 2001-07-09 2003-01-16 Sasken Communication Technologies Limited Technique for compiling computer code to reduce energy consumption while executing the code
US20040078411A1 (en) * 2002-10-22 2004-04-22 Joshua Porten Galois field arithmetic unit for use within a processor
US20040123276A1 (en) * 2002-12-19 2004-06-24 Paul Knueven Assembly directives for the support of multi-language programming
US20070044079A1 (en) * 2005-06-02 2007-02-22 Tharas Systems Inc. A system and method for compiling a description of an electronic circuit to instructions adapted to execute on a plurality of processors

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011199A1 (en) * 2005-06-20 2007-01-11 Microsoft Corporation Secure and Stable Hosting of Third-Party Extensions to Web Services
US8849968B2 (en) 2005-06-20 2014-09-30 Microsoft Corporation Secure and stable hosting of third-party extensions to web services
US20100162185A1 (en) * 2005-08-12 2010-06-24 University Of Sussex Electronic circuit design
US7895560B2 (en) 2006-10-02 2011-02-22 William Stuart Lovell Continuous flow instant logic binary circuitry actively structured by code-generated pass transistor interconnects
US20080082786A1 (en) * 2006-10-02 2008-04-03 William Stuart Lovell Super-scalable, continuous flow instant logic™ binary circuitry actively structured by code-generated pass transistor interconnects
WO2008042186A2 (en) * 2006-10-02 2008-04-10 Lovell William S Information processing using binary gates structured by code-selected pass transistors
WO2008042186A3 (en) * 2006-10-02 2008-09-25 William S Lovell Information processing using binary gates structured by code-selected pass transistors
US20080215768A1 (en) * 2006-10-24 2008-09-04 Alastair David Reid Variable coherency support when mapping a computer program to a data processing apparatus
US8250549B2 (en) * 2006-10-24 2012-08-21 Arm Limited Variable coherency support when mapping a computer program to a data processing apparatus
US20080244507A1 (en) * 2007-03-30 2008-10-02 Microsoft Corporation Homogeneous Programming For Heterogeneous Multiprocessor Systems
US8789063B2 (en) 2007-03-30 2014-07-22 Microsoft Corporation Master and subordinate operating system kernels for heterogeneous multiprocessor systems
US20100042751A1 (en) * 2007-11-09 2010-02-18 Kouichi Ishino Data transfer control device, data transfer device, data transfer control method, and semiconductor integrated circuit using reconfigured circuit
US9710241B2 (en) * 2009-02-18 2017-07-18 Samsung Electronics Co., Ltd. Apparatus and method for providing instruction for heterogeneous processor
US20100211760A1 (en) * 2009-02-18 2010-08-19 Egger Bernhard Apparatus and method for providing instruction for heterogeneous processor
US8582654B1 (en) * 2009-09-30 2013-11-12 Hewlett-Packard Development Company, L.P. Generating a deblocked version of video frames using motion estimation
US8789026B2 (en) * 2011-08-02 2014-07-22 International Business Machines Corporation Technique for compiling and running high-level programs on heterogeneous computers
US8938725B2 (en) 2011-08-02 2015-01-20 International Business Machines Corporation Technique for compiling and running high-level programs on heterogeneous computers
US20130036408A1 (en) * 2011-08-02 2013-02-07 International Business Machines Corporation Technique for compiling and running high-level programs on heterogeneous computers
US8776035B2 (en) * 2012-01-18 2014-07-08 International Business Machines Corporation Providing performance tuned versions of compiled code to a CPU in a system of heterogeneous cores
US20130185704A1 (en) * 2012-01-18 2013-07-18 International Business Machines Corporation Providing performance tuned versions of compiled code to a cpu in a system of heterogeneous cores
US9195443B2 (en) 2012-01-18 2015-11-24 International Business Machines Corporation Providing performance tuned versions of compiled code to a CPU in a system of heterogeneous cores
US9652817B2 (en) 2015-03-12 2017-05-16 Samsung Electronics Co., Ltd. Automated compute kernel fusion, resizing, and interleave
US11755382B2 (en) * 2017-11-03 2023-09-12 Coherent Logix, Incorporated Programming flow for multi-processor system

Also Published As

Publication number Publication date
US8386751B2 (en) 2013-02-26
US20110131393A1 (en) 2011-06-02
WO2006017482A2 (en) 2006-02-16
US20070294511A1 (en) 2007-12-20
US20060015703A1 (en) 2006-01-19
WO2006017482A3 (en) 2007-11-15
US20130238878A1 (en) 2013-09-12
US8880850B2 (en) 2014-11-04
US7721069B2 (en) 2010-05-18

Similar Documents

Publication Publication Date Title
US20060026578A1 (en) Programmable processor architecture hirarchical compilation
US7278137B1 (en) Methods and apparatus for compiling instructions for a data processor
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
KR101713815B1 (en) A tile-based processor architecture model for high efficiency embedded homogeneous multicore platforms
Lodi et al. A VLIW processor with reconfigurable instruction set for embedded applications
US6826674B1 (en) Program product and data processor
US20130290693A1 (en) Method and Apparatus for the Automatic Generation of RTL from an Untimed C or C++ Description as a Fine-Grained Specialization of a Micro-processor Soft Core
Glossner et al. The sandbridge sb3011 platform
David et al. DART: A Dynamically Reconfigurable Architecture Dealing with Future Mobile Telecommunications Constraints.
US20220100680A1 (en) Apparatuses, methods, and systems for a configurable accelerator having dataflow execution circuits
Voitsechov et al. Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays
Lodi et al. A pipelined configurable gate array for embedded processors
US9329872B2 (en) Method and apparatus for the definition and generation of configurable, high performance low-power embedded microprocessor cores
Sias et al. Enhancing loop buffering of media and telecommunications applications using low-overhead predication
Gesper et al. Evaluation of different processor architecture organizations for on-site electronics in Harsh environments
US20120017067A1 (en) On-demand predicate registers
Glossner et al. The sandbridge sandblaster communications processor
Podobas Accelerating parallel computations with openmp-driven system-on-chip generation for fpgas
Balfour Efficient embedded computing
Glossner et al. A multithreaded processor architecture for SDR
Vassiliadis et al. The arise approach for extending embedded processors with arbitrary hardware accelerators
Chen et al. Flexible heterogeneous multicore architectures for versatile media processing via customized long instruction words
Raghavan et al. Distributed loop controller for multithreading in unithreaded ILP architectures
Nguyen et al. Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator
Jungeblut et al. A systematic approach for optimized bypass configurations for application-specific embedded processors

Legal Events

Date Code Title Description
AS Assignment

Owner name: 3PLUS1 TECHNOLOGY, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RAMCHANDRAN, AMIT;HAUSER, JR., JOHN REID;REEL/FRAME:016985/0701

Effective date: 20050802

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: 3PLUS1 TECHNOLOGY, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HAUSER, JOHN REID, JR.;RAMCHANDRAN, AMIT;SIGNING DATES FROM 20100806 TO 20100816;REEL/FRAME:024878/0807

AS Assignment

Owner name: ICELERO LLC, DELAWARE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:3PLUS1 TECHNOLOGY INC.;REEL/FRAME:024892/0154

Effective date: 20100818

AS Assignment

Owner name: SATELLITE INVESTORS, LLC, DELAWARE

Free format text: SECURITY AGREEMENT;ASSIGNOR:ICELERO, LLC;REEL/FRAME:024900/0800

Effective date: 20100818