US20060026578A1

US20060026578A1 - Programmable processor architecture hirarchical compilation

Info

Publication number: US20060026578A1
Application number: US11/195,429
Authority: US
Inventors: Amit Ramchandran; John Hauser
Original assignee: 3Plus1 Technology Inc
Current assignee: Icelero LLC
Priority date: 2004-08-02
Filing date: 2005-08-02
Publication date: 2006-02-02
Also published as: US8386751B2; US20110131393A1; WO2006017482A2; US20070294511A1; US20060015703A1; WO2006017482A3; US20130238878A1; US8880850B2; US7721069B2

Abstract

One embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.

Description

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Patent Application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture” and filed on Aug. 2, 2004 and is a continuation-in-part of U.S. patent application Ser. No. 11/180,068, filed on Jul. 12, 2005 and entitled “PROGRAMMABLE PROCESSOR ARCHITECTURE”, the disclosures of both of which are incorporated herein by reference as though set forth in full.

BACKGROUND OF THE INVENTION

1. Field of the Invention
This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
2. Description of the Prior Art
With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPods and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
To provide specific examples for current processor problems, problems associated with RISCs, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
A. RISC/Super Scalar Processors
RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
Pros:

- Industry wide acceptance has lead to a more matured tool chain and wide software choices
- A robust programming model has resulted from a very efficient automatic code generator used to generate binaries from high level languages like C.
- Processors in the category are very good general purpose solutions.
- Moore's Law can be effectively used for increasing performance.

Cons:

- The general purpose nature of the architecture does not leverage common/specific characteristics of a set or sub-set of applications for better price, power and performance.
- They consume moderate to high amounts of power with respect to the amount of computation provided.
- Performance increase is mostly achieved at the expense of pipeline latency which adversely affects several multimedia and communication algorithms.
- Complicated hardware scheduler, sophisticated control mechanisms and significantly reduced restrictions for more efficient automatic code generation for general algorithms have made this category of solutions less area efficient.

B. Very Long Instruction Word (VLIW) and DSPs
VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
Pros:

- Restricting the solution to the signal processing space improved 3P in comparison with RISC and Super Scalar architectures
- VLIW architectures provide higher level of parallelism relative to RISC and superscalar architectures.
- An efficient tool chain and industry wide acceptance was generated fairly rapidly.
- Automatic code generation and programmability are showing significant improvements as more processors designed for signal processing fall into this category.

Cons:

- Although problem solving capability is reduced to the digital signal processing space, it is too broad for a general solution like VLIW machine to have efficient 3P.
- Control is both expensive and power consuming especially for primitive control code in many multimedia and communication applications.
- Several power and area inefficient techniques were used to make automatic code generation easy. Strong reliance on these techniques by the software community is carrying forward this inefficiency from generation to generation.
- VLIW architectures are not well suited for processing serial code.

C. Reconfigurable Computing
Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain FPGA like architectures.
Pros:

- Some designs restricted to a specific application while providing needed flexibility within that application proved to be price, power, performance competitive
- Research showed that such restricted yet flexible solutions can be created to address many application hotspots.

Cons:

- Several designs in this space did not provide an efficient and easy programming solution and therefore was not widely accepted by a community adept in programming DSPs.
- Automatic code generation from higher level languages like C was either virtually impossible or highly inefficient for many of the designs.
- 3P advantage was lost when an attempt was made to combine heterogeneous applications using one type of interconnect and one level of granularity. Degree of utilization of the provided parallelism suffered heavily.
- Reconfiguration overhead was significant in 3P for most designs.
- In many cases, the external interface was complicated because the proprietary reconfigurable fabric did not match industry standard system design methodologies.
- Reconfigurable machines are uni-processors and rely heavily on a tightly integrated RISC even for processing primitive control.

D. Array of Processors
Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
Pros:

- Different processors optimized for different sets of applications when connected together using an efficient fabric can help solve a wide range of problems.
- Uniform scaling model allows number processors to be connected together as performance requirements increase.
- Complex algorithms can be efficiently partitioned.

Cons:

- Although performance requirements may be adequately answered, power and price inefficiencies are too high.
- The programming model varies from processor to processor. This makes the job of the application developer much harder.
- Uniform scaling of multiple processors is a very expensive and power consuming resource. This has shown to display some non-determinism that may be detrimental to the performance of the entire system.
- The programming model at the system level suffers from complexity of communicating data, code and control information without any shared memory resources—since shared memory is not uniformly scalable.
- Extensive and repetitive glue logic required to connect different types of processors to a homogeneous network adds to the area inefficiencies, increases power and adds to the latency.

In light of the foregoing, there is a need for a low-power, inexpensive, efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.

SUMMARY OF THE INVENTION

Briefly, one embodiment of the present includes a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits or greater in parallel, W being an integer value, at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value wherein and smaller than W. A scenario compiler is included in a hierarchical flow of compilation and used with other compilation and assembler blocks to generate binary code based on different types of codes to allow for efficient processing based on the sub-processors while maintaining low power consumption when the binary code is executed.

IN THE DRAWINGS

FIG. 1 shows an application 10 with reference to a digital product 12 including an embodiment of the present invention
FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention.
FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1.
FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention.
FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A sub-processor (“CoolProcessor) is provided employing logic “macro-functional-units” (function-oriented dedicated logic), replacing the classical fixed datapath, complex exectution unit, and register set used in general purpose CPUs and DSP engines and replacing also the identical processing element used in homogeneous multiprocessors (MSs).
As shown and described below with reference to FIG. 2, the sub-processor employs a heterogeneous interconnect between heterogeneous processors, designed to match multimedia and communications applications.
One embodiment of the present invention employs four sub-processors (referred to as “black boxes” or “processor” in the provisional application No. 60/598,417, entitled “Quasi-Adiabatic Programmable Processor Architecture”). In this patent document, a processor 22 comprises a plurality of sub-processors. The four sub-processors are split inot two categories. The letter “W” designates CoolW sub-processors, capable of handling operands requiring wide datapaths. The CoolW sub-processor, however, will support wider rage data bits. The sub-processor is also capable of executing 64-bit IEEE-standard floating-point instructions. Its performance is greater than 49 MFLOPS at 150 MHz. The floating-point instruction set includes addition, subtraction, and multiplication.
The letter “N” indicates a CoolN sub-processor serving narrow datapaths, such as required for average-quality imaging and finite-field operations in communications. Each sub-processor comprises a heterogeneous software programmable datapath connection compute engines (in the CoolW sub-processor type) or compute engines (in the CoolN sub-processor type). The internal compute engines are referred to as MFU. Multiple instances of the MFUs are nonuniformly distributed between the two types of sub-processors.
A control circuit within each sub-processor operates as an engine and is a high language programmable controller for the sub-processor. The control circuit is aided by a core sequencer underscoring the hard-wired nature of the MFUs: each unit is aimed at executing efficiently only a fraction of the overall job. A rather large instruction memory, per sub-processor, holds code for the control circuit, internal interconnects, I/O, and MFUs requiring it. Sufficient shared buffer memory is provided to store operands and results of complex computations that make average demands on operand life.
A general purpose processor (referred to as “ARM926” in the provisional application) runs system software and generic applications (applications other than multimedia and communications). The GPP includes its own instruction and data memory or cache.
The interconnect is based on the Sonics “smart” SoC bus. An SoC architecture can include any number of sub-processors but the number of sub-processors defines the number of threads, as will be apparent shortly.
Referring now to FIG. 1, an application 10 is shown with reference to a digital product 12 including an embodiment of the present invention. FIG. 1 is intended to provide the reader with a perspective regarding some, but not necessarily all, of the advantages of a product, which includes an embodiment of the present invention relative to those available in the marketplace.
Accordingly, the product 12 is a converging product in that it incorporates all of the applications that need to be executed by today's mobile phone device 14, digital camera device 16, digital recording or music device 18 and PDA device 20. The product 12 is capable of executing one or more of the functions of the devices 14-20 simultaneously yet utilizing less power.
The product 12 is typically battery-operated and therefore consumes little power even when executing multiple applications of the applications executed by the devices 14-20. It is also capable of execute code to effectuate operations in conformance with a multitude of applications including but not limited to: H264, MPEG4, UWB, Bluetooth, 2G/2.5G/3G/4G, GPS, MP3 and Security.
FIG. 2 shows an exemplary integrated circuit 20 including a heterogenous, high-performance, scalable processor 22 coupled to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention. Further shown in FIG. 2, the processor 22 is coupled to interface circuit 26 through a general purpose bus 30 and to the interface circuit 28 through a general purpose bus 31 and further coupled, through the bus 30, to a general purpose processor 32 through the bus 31. The circuit 20 is further shown to include a clock reset and power management 34 for generating a clock utilized by the remaining circuits of the circuit 10, a reset signal utilized in the same manner and circuitry for managing power by the same. There is further included in the circuit 20, a Joint Test Action Group (JTAG) circuit 36. JTAG is used as a standard for testing chips.
The interface circuit 26 shown coupled to the bus 30 and interface circuit 28, shown coupled to the bus 31, include the blocks 40-66, which are generally known to those of ordinary skill in the art and used by current processors.
The processor 22, which is a heterogeneous multi-processor, is shown to include shared data memory 70, shared data memory 72, a CoolW sub-processor (or block) 74, a CoolW sub-processor (or block) 76, a CoolN sub-processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has associated therewith an instruction memory, for example, the CoolW block 74 has associated therewith an instruction memory 82, the CoolW block 76 has associated therewith an instruction memory 84, CoolN block 78 has associated therewith an instruction memory 86 and the CoolN block 80 has associated therewith an instruction memory 88. Similarly, each of the blocks 74-80 has associated therewith a control block. The block 74 has associated therewith a control block 90, the block 76 has associated therewith a control block 92, the block 78 has associated therewith a control block 94 and the block 80 has associated therewith a control circuit 96. The block 74 and 76 are designed to generally operate efficiently for 16, 24, 32 and 64-bit operations or applications, whereas, the blocks 78 and 80 are designed to generally operate efficiently for 1, 4, or 8-bit operations or applications.
The blocks 74-80 are essentially sub-processors and the CoolW blocks 74 and 76 are wide (or W) type of blocks, whereas, the CoolN blocks 78 and 80 are narrow (or N) type of blocks. Wide and narrow refers to the relative number of parallel bits processed or routed within a sub-processor and that gives the heterogeneous characteristic of the processor 22. Furthermore, the circuit 24 is coupled directly to one of the sub-processors, i.e. one of the blocks 74-80 resulting in the lowest latency path through the sub-processor to which it is coupled. In FIG. 2, the circuit 24 is shown directly coupled to the block 76 although it may be coupled to any of the blocks 74, 78 or 80. Higher priority agents or tasks may be assigned to the block which is directly coupled to the circuit 24.
It should be noted that while four blocks 74-80 are shown, other number of blocks may be utilized, however, utilizing additional blocks clearly results in additional die space and higher manufacturing costs.
Complicated applications requiring great processing power are not scattered in the circuit 20, rather, they are grouped or confined to a particular sub-processor or block for processing, which substantially improves power consumption by eliminating or at least reducing wire (metal) or routing lengths thereby reducing wire capacitance. Additionally, utilization is increased and activity is reduced contributing to lower power consumption.
The circuit 20 is an example of silicon on chip (or SoC) offering Quasi-Adiabatic Programmable sub-Processors for multimedia and communications applications, two types of sub-processors are included, as previously indicated: W type and N type. W type or Wide type processor is designed for high Power, Price, Performance efficiency in applications requiring 16, 24, 32 and 64-bits of processing. N type or Narrow type processor is designed for high efficiency in applications requiring 8, 4 and 1-bit of processing. While these bit numbers are used in the embodiments of the present invention, by way of figures and description, other number of bits may be readily employed.
Different applications require different performance or processing capabilities and are thus, executed by a different type of block or sub-processor. Take for instance, applications that are typically executed by DSPs, they would be generally be processed by W type sub-processors, such as the blocks 74 or 76 of FIG. 2 because they characteristically include commonly occurring DSP kernels. Such applications include, but are not limited to, fast fourier transform (FFT) or inverse FFT (IFFT), Adaptive finite impulse response (FIR) filters, Discrete Cosine transform (DCT) or inverse DCT (IDCT), Real/Complex FIR filter, IIR filter, resistance capacitor Root Raise Cosine (RRC) filter, Color Space Converter, 3D Bilinear Texture Mapping, Gouraud Shading, Golay Correlation, Bilinear Interpolation, Median/Row/Column Filter, Alpha Blending, Higher-Order Surface Tessellation, Vertex Shade (Trans/Light), Triangle Setup, Full-Screen Anti-aliasing and Quantization.
Other commonly occurring DSP kernels can be executed by N type sub-processors, such as blocks 78 and 80 and include, but are not limited to, Variable Length Codec, Viterbi Codec, Turbo Codec, Cyclic Redundancy Check, Walsh Code Generator, Interleaver/De-Interleaver, LFSR, Scrambler, De-spreader, Convolution Encoder, Reed-Solomon Codec, Scrambling Code Generator, and Puncturing/De-puncturing.
Both W and N type sub-processors are capable of keeping net activity and the resulting energy per transition low while maintaining high performance with increased utilization in comparison with existing architectural approaches like RISC, Reconfigurable, Superscalar, VLIW and Multi-processor approaches. The sub-processor architecture of the processor 22 reduces die size resulting in an optimal processing solution and includes a novel architecture referred to as “Quasi-Adiabatic” or “COOL” architecture. Programmable processors in accordance therewith are referred to as Quasi-Adiabatic Programmable or COOL Processors.
Quasi-Adiabatic Programmable or COOL Processors optimize data path, control, memory and functional unit granularity to match a finite subset of applications, as described previously. The way in which this is accomplished will be clear relative to a discussion and presentation of figures relating to the different units or blocks or circuits and their inter-operations of the processor 22, as presented below.
“Quasi-Adiabatic Programmable” or Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) Processors. In term of thermodynamics, Adiabatic Processes do not waste heat and transfer all the used energy to performing useful work. Due to the non-adiabatic nature of existing standard processes, circuit design, and logic cell library design techniques, one can not ever make an Adiabatic Processors. However, among the possible different possible processor architecture some may be closer to Adiabatic. The various embodiments of the present invention show a class of processor architectures which are significantly closer to Adiabatic as compared to the architectures of prior art, while they are, nevertheless, programmable. They are referred to as “Quasi-Adiabatic Programmable Processors”.
The integrated circuit 20 allows as many applications as can be supported by the resources within the processor 22 to be executed together or concurrently and the number of such applications far exceeds that which is supported by current processors. Examples of applications that can be simultaneously or concurrently executed by the integrated circuit 20 include but are not limited to downloading an application from a wireless device while decoding a movie that has been received, thus, a movie can be downloaded and decoded simultaneously. Due to achieving simultaneous application execution on the integrated circuit 20, which has a small die size or silicon real estate as compared to the number of applications it supports, costs of manufacturing the integrated circuit are significantly lower than that which is required for multiple devices of FIG. 1. Additionally, the processor 22 offers a single programmable framework to a user to implement multiple functions, such as multimedia complex applications. Of important value is the ability of the integrated circuit 20 and namely, the processor 22, to support future standards adopted by the industry, which are expected to be of greater complexity than that of today's standards.
Each of the blocks 74-80 can execute only one sequence (or stream) of programs at a given time. A sequence of program is referred to a function associated with a particular application. For example, FFT is a type of sequence. However, different sequences may be dependent on one another. For example, an FFT program, once completed may store its results in the memory 70 and the next sequence, may then use the stored result. Different sequences sharing information in this manner or being dependent upon each other in this manner is referred to as “stream flow”.
In FIG. 2, the memories 70 and 72 each include 8 blocks of 16 kilobytes of memory, however, in other embodiments, different size memory may be utilized.
The instruction memories 82, 84, 86 and 88 are used to store instructions for execution by the blocks 74-80, respectively.
FIG. 3 shows, in conceptual form, an architecture 300 including software architecture 302 in combination with some of the hardware components 304 of the circuit 20 of FIG. 1. The hardware components 304 includes the processor 32, the circuit 26 and circuit 28 and the processor 22, as described and shown with respect to previous figures.
Included within the software architecture 302, a hardware abstraction layer or low level drivers 306 and an operating systems driver 308 cause interfacing or communication between the hardware components 304 and the software architecture 302. The software architecture 302 is further shown to include a CoolBios (basic input output system) 310 coupled to the hardware components 304 and to a scenario 312, which is for causing multiple applications 314 to be executed, each application 314 including kernels 316 for execution of computationally-intense functions, such as fast fourier transforms (FFTs), DCTs, Finite Impulse Response (FIR) filtering and others know in the industry. The software architecture 302 is further shown to include a system level software changes scenarios 318, which is shown to communicate with an operating systems interface (OSI) 322 and an operating system 320. The operating system 320 is further shown to communicate with the scenario 312, applications 314, and kernals 316. the kernels 316 are engines for execution of computationally intensive code, generally in assembly, or low level code.
Each of the applications 314 includes many kernels, such as the kernals 316 DCT, VLC, conditional encoding (CE), cyclic redundancy coding (CRC), down sampling (DS), variable length coding (VLC), discrete cosine transform (DCT), motion estimation (ME), motion compensation (MC) etc., that consume most of the compute time in an application. The scenario-level software 310 contains hooks to quasi-statically change the execution pattern of applications contained within that scenario. The scenario 318 causes scenarios to be changed while running on the hardware 304. From a software perspective, each of the kernels 316 is written in assembly code for executing an FFT or other computationally-intensive functions while the scenario 312 and each of the applications 314 are in a higher level language, such as “C” for reasons that will become apparent shortly. For now, suffice it to say that the combination of assembly and a higher level language being executed on a subprocessor CoolW or CooN and a control block included therein, as the hardware architecture of FIG. 2, causes simultaneous or concurrent execution of applications, in a hierarchical manner and while maintaining low power consumption.
The CoolBios 310 includes a set of software functions that allow input and output communication with the processor 22 and eliminates the need for a full operating system running on the processor 22.
The hardware component 304 and software architecture 302 provide an environment to load and execute a multi-application scenario. A “scenario”, as referred to herein, is a set of applications, such as the applications 314, executing concurrently. Some examples of each of the applications 314, as shown in FIG. 3, include but are not limited to JPEG, MP3, H.264 and 802.11g. A scenario 312 interfaces with the operating system 320 and higher-level software through the OSI 322 and the drivers 308.
The software architecture 302 and the hardware components 304 of FIG. 3 allow an operating system (OS) to be loaded onto the processor 32 and the drivers 308 to ultimately allow a scenario 312 to be loaded for causing multiple applications to be executed concurrently
The scenario 312 includes information, in its header, overhead information, to cause turning on or off each of the different applications 314. For example, the JPEG application can be turned off while the remaining applications, such as MP3, H.264 and 802.11g remain on. This effectively aids in reducing power consumption, as the need for power is reduced when an application that is not currently being used is turned off. Remaining processing power, i.e. that which is not currently being used, may be devoted to executing a new application with some limitations, as are now discussed.
Essentially, there are three modes of operation within the software architecture 302. One is real-time mode, an example of which is 802.11g, which has hardware time constraints. In this case, it is not feasible to add another application because a scenario 312 that includes an 802.11g application has compiled the latter and in the presence of a pre-complied application, a new application cannot be added. Generally, in the presence of applications having a timing constraint, a new application is not readily added or to dynamically change scenarios because it disturbs the processing balance, however, this is not an issue in mobile applications because scenarios are not readily changed in such applications.
The scenario 312 is pre-compiled and quasi-statically scheduled, which refers to turning applications on or off. The pre-compiled and scheduled scenario 312, which is in binary form is then stored in one of the sub-processors, such as the sub-processor 74. Turning off an application prevents “choking” of the system, that is, bandwidth is improved.
The system level software changes scenarios 318 causes changing of the scenario 312, which, as previously-stated, may be done dynamically. The code in the latter is in “C” or a high level code. The scenario 312 is written in scenario descriptive language (SDL), which is a unique and proprietary language with all rights reserved by 3Plus1 Technology, Inc. of Saratoga, Calif.
On the right-hand side of FIG. 3, the hierarchical software/compiler characteristics of the architecture 300 is shown, in conceptual form, and in reference to a software tools hierarchical column 340, a hierarchical level column 342 and a hardware hierarchical levels column 344. Each of the pieces of software of the software architecture 302 is taken through a different tool so as to avoid a flat methodology.
The drivers 306 and 308 are used as tools for the general purpose processor (GPP) 32 on the highest level of the tool column 340 while, in the next level of the hierarchal tools, a scenario compiler 348 is used, by an application programmer, to allocate resources and executed on one or more particular sub-processors. The kernels 316 are then advantageously partitioned. An application is divided into smaller portions or threads, switching from one kernel to another.
The number of threads is limited to the number of sub-processors. The way in which applications are handed from one kernel to another is by the kernel 316 that is currently operating to finish a particular function, saving the result of the function in shared memory and signaling completion of its function and then another kernel 316 utilizing the stored information in shared memory to perform another function. A synchronization code is used for this hand-off, which is done by the scenario 312 and the particular tool is the scenario compiler 348 and is automated. Thus, synchronization and control code are generated automatically due to the presence of the thread.
In the next level of the tool hierarchy, as shown in the column 340, a controller/compiler 350 is used to compile a high level language being employed, such as “C”, which includes two parts, an optimizing assembler 352 and a low level assembler 354. The goal is to allow the programmer to write mostly C or high level code, rather than assembly, as the former is easier. This is easily allowed for given the sub-processor and hierarchal architecture of the present invention. The compiler 350 is optimized for each sub-processor, such as CoolW or CoolN. That is, high level code, written by a user or programmer is compiled, pursuant to certain rules, for storage and execution by a sub-processor and a control block located therein, as previously shown and discussed.
By changing scenarios, multiple applications can be performed, for example, a digital camera and a PDA can be performed in a single device simultaneously. The ability to do so results in foregoing the dynamic ability to change or add a scenario, as might be done in a personal computer, but this limitation is completely tolerable as a device that is to be used with a certain scenario need normally be quickly programmed to include another scenario in mobile handheld device applications.
By way of example, if a manufacturer introduces a product, such as a PDA, this is compiled along with other applications, such as a digital camera or MP3, etc., and a pre-compiled binary code is created using the hierarchical software tools compilation and the sub-processor-based hardware architecture of the present invention. Such a pre-compiled code and multiple applications make up a scenario, now, while another scenario may be pre-compiled, it is a rare occurrence due to the reluctance of the manufacturer to quickly introduce another product. Given time, another product is likely to be introduced warranting another scenario but the time to switch to another scenario is far from urgent.
In FIG. 3, column 342 states which part or component of the hardware are utilized for the corresponding tool of column 340. That is, viewed in a row, each location of the row within column 342 corresponds to a like-location in column 340. Thus, the GPP IDE 346 is handled by the hardware 304, the scenario compiler 348 is handled by the processor 22 and the compiler 350 is handled by a control block of one of the sub-processor and the assemblers 352 and 354 are handled by one or more of the sub-processors, such as the sub-processor 72. The particular hardware hierarchical levels are correspondingly enumerated in column 344 by reference.
The low level assembler 354 scheduling is done but also, all of the hardware components are available, whereas, the optimizing assembler 352 includes more restrictions because it operates at a higher level but is able to schedule more. Area and power is saved by less scheduling. The hierarchical flow of column 340 and the hardware architecture of the processor of FIG. 2 allow for an efficient, low power and flexible processing tool. In prior art, while the assembler 354 and the GPP IDE 346 are used, the remainder of the column 340 are not.
With continued reference to FIG. 3, an hierarchical compilation involves partitioning the application code into a general purpose processor component that allows interaction between the processor 32 and all other hardware components. This general-purpose processor component is mainly to allow switching between different scenarios.
Scenarios are compiled to run on a combination of multiple sub-processors that communicate through shared memory. The scenario compiler 348 is the tool that schedules the coarse grain data dependency graph wherein kernels and control code in one or more applications communicate with each other and with the controlling general-purpose processor. Dependencies are resolved to determine trigger conditions based upon which synchronization code is generated to evaluate these conditions at run-time. The compiler 350 targets the subset of a sub-processor or the control block located therein (such as the control block 90) that execute application control code and the scenario control and synchronization code.
The optimizing assembler 352 and the low-level assembler 354 target functions written in sub-processor assembly. They incorporate many scheduling techniques often found in higher level compilers such as register allocation and software pipelining. The binary software objects generated by these assemblers execute either on a CoolW or CoolN processor.
The scenario compiler 348 schedules the correct operation of the applications' functions and allocates data resources. The scenario compiler uses scheduling algorithms from the existing art to create the schedule. The scenario compiler 348 emits the source code (a compilable program, written in C) that implements the scenario scheduler. The scenario scheduler implements, in software, the schedule chosen by the scenario compiler. That is, it manages application resources (data—placed into shared and external memory—and functions) that are partitioned among the multiple processor cores contained within the target device. The scheduler ensures the correct sequencing and synchronization of functions and data that are in use within each of the multiple processors. The scenario compiler 348 also generates code to correctly access peripherals and DMA controllers as referenced by SDL-specific language features based on information about the target heterogeneous multi-processor(s) provided to the scenario compiler 348.
The SDL allows for a collection of functionality used in the present invention. The Scenario Description Language (SDL) is a language created for the purpose of creating high-level, abstract descriptions of scenarios and the applications contained within. SDL is compact, human-readable, and scalable. SDL provides language syntax and semantics to describe: the flow of data into and out of the sub-processors and between functions executing on the sub-processor; the amount of storage required to stream data through the applications executing on the sub-processor; the priority of each application to facilitate the creation of a functionally correct schedule that satisfies latency requirements; the amount of data (and its type) produced and consumed by each function; the maximum (worst case) execution time of each function, which is used in the creation of the schedule; and the placement of each function onto W- or N-type sub-processors.
FIG. 4 shows, in conceptual form, the process of hierarchical software compilation, in block flow form, in accordance with a method of the present invention. In FIG. 4, the blocks that are shown shaded, such as blocks 402, 412, 422, 420, 424, 428, 440, 430-438 are based on a sub-processor and need be executed thereby whereas, the remaining blocks having no shading are irrespective of sub-processors.
Generally, FIG. 4 shows the way in which software is compiled for use by the processor 22. In FIG. 4, there is shown further details of some of the blocks of FIG. 3. Specifically, the scenario compiler 348 of FIG. 3 is the scenario compiler block 416 of FIG. 4 and the scenario description block 418, the adjust scenario description and optional optimization block 408, and the processor-specific data block 422 serve as support for the scenario compiler block 416. The controller/compiler block 424 is the same as 350 of FIG. 3. The assembly code block 430 and the optimizing assembler 432 is the same as the optimizing assembler 352 of FIG. 3 and the low level assembler and linker block 434 of FIG. 4 is the same as the low level assembler 354 of FIG. 3. Thus, the description of these blocks will not be repeated.
The block 416 allows the programmer to meet his/her design goals without having to optimize either the high level code or the assembly code. Having the SDL allows for allocating a function from one block to another block at a high level. The block 416 serves as a street map. The adjust partitioning and kernels of FIG. 21 of the “PROGRAMMABLE PROCESSOR ARCHITECTURE” patent application incorporated herein by reference is the same as the block 408.
The scenario description block 416 serves as input to the scenario compiler block 418, as does the block 422. The output of the block 418 serves as input to the block 420 and the block 408 serves as input to the block 416. The block 416 describes inter-dependencies between the kernels 316 and applications 314 of FIG. 3. The SDL is used by the block 416. The optimizing block 410 is used to optimize high level code and assembly code. Thus, the block 410 provides input to the blocks existing assembly code block 412 and the existing high level code block 414. The scenario compiler block 418 receives two sets of information, one is a bottom-up set of information and another is a top-down set of information. An example of the former is the kernels 316, i.e. FFTs, DCTs, etc., which is provided by the block 422 to the block 418 and it is assembled, optionally, with power information during assembly. An example of the latter is provided by the block 416, which is programmed in SDL to serve as control code and for defining inter-dependencies of the kernels and requirement of the application. An example of the requirements of the application is the length of time that can be used for processing a frame of information other time-related requirements.
The block 418, once provided with the foregoing top-down and bottom-up information, performs a best match process in the form of a schedule. The schedule, for example, provides information regarding the inter-dependencies of the sub-processors execution of which requires synchronization code for the control circuit of a sub-processor. The schedule information and synchronization information are provided by the block 420, which receives input from the block 418. The output of the block 420 is provided as input to the block 424. Having the block 420 receiving its input from the block 418 is generally not performed by prior art techniques due to their design/hardware limitations. That is, the hardware architecture, based on sub-processors, as shown in previous figures and the referenced patent document, allows for scheduling and synchronization after the block 418 performs its operation. This allows for the control circuit and each sub-processor to be the same as the other and for the code to be transportable.
The non-native compilation and simulation block 428 is for compiling in the absence of a processor, that is, during development, while the hardware is yet not ready, compilation is performed in “non-native” environment, whereas, the native simulation block 440 is in native environment. The block 428 allows for both assembly and high level code compilation while a native compiler or the actual compiler to be ultimately employed is not yet ready. Thus, an off-the-shelf compiler, i.e. non-native, may be employed and combined with assembly code for simulation. This is sub-processor specific. The kernels 316 and the time consumed for executing control code compete.
In FIG. 4, the output of the existing assembly functions block 402 serves as input to the existing assembly code 412, which also receives input from the block 410. The output of the block 412 serves as input to the block 432, which also receives input from the block 430. The output of the block 432 serves as input to the block 434 and the block 434 serves as input to the block 436, which provides input to the native simulation block 440 and the implementation complete block 438. The block 440 provides input to the a decision block 442, which determines whether or not design goals have been made and if so, the block 438 is performed and if not, either the code is optimized by the block 410 or the scenario description is adjusted by the block 408. The output of the blocks 412, 414 and 420 are all provided to the block 428. That is, assembly code, high level code and schedule and synchronization code are all provided to the block 428 for execution or simulation. The output of the block 428 is provided to a decision block 426 for determining whether or not design goals are met, if so, the process is exited, otherwise, the scenario may be adjusted by the block 408. The output of the block 414 is provided to the block 424 for compilation thereof and the output of the latter is provided to the block 430.
Optimization is done on a partition-basis. That is, high level code is optimized separately from assembly code and from SDL. For example, assembly code is optimized by the block 432, high level code is optimized by the block 410 and SDL is optimized by the block 416. This is a divide and conquer approach allowing advantageous optimization of each type of code that is not attainable without such a division of code. The block 424 receives high level code and compiles the same but output assembly code to the block 430, which is optimized by the block 432. The output of the block 432 is provided to the block 434 for creation of still further low level code and the output of the block 434 is provided to the block 436 for generation of binary object code to be used by a sub-processor. The assembly code that is written by the programmer is provided from the block 412 to the block 432 for assembling.
FIG. 5 shows, in conceptual form, the process of and apparatus for latency verification used in assembly coding and included within the block 434 of FIG. 4 in accordance with a method and apparatus of the present invention. In FIG. 5, instructions 502 are shown to be provided to the block 434, which will also be referred to as the latency verification block. That is, the current instruction is referred to as “instruction n”, while the instruction previous to the latter is referred to as “instruction n−1”, while the instruction previous to the latter is referred to as “instruction n−2” and the instruction previous to that is referred to as “instruction n−3” and so on. That is, an n−1 instruction is delayed by a program cycle from an n instruction and so on.
The block 434 performs various functions, shown in FIG. 5, in blocks or diamond shapes. Such functions include identifying all instructions directly contributing to instruction n 508, which is then used to determine latency rules relevant to instructions at 506 and it is provided to a decision block at 510 for identifying the earliest instruction. If the earliest instruction is identifiable, the process continues to 514 checking the latency rule against the register values latencies and if any error results, the process continues to 516 at which time an error is reported. If no error is detected at 514, the next instruction is processed at 518. The process of FIG. 5 is done during compilation and serves as check for the programmer.
At 506, rules are used to determine what the actual latencies are using a database of rules. At 510, this determination is made because previous instructions are not necessarily known. For example, the first instruction of a sub-routine is one where its previous instruction is not necessarily known. At 512, worst case possibilities are determined. At 514, latency rules are checked against the register value latencies. A latency is basically a delayed or previous instruction. That is, the programmer's annotation is compared to the rules for latency and if there is a mismatch, an error is reported at 516. An example of the programmer's annotation is discussed hereinbelow.
A computer processor implements strict read-after-write behavior for a register when an instruction that reads the register always obtains the value written by the most recent previously executed instruction that writes the register. To achieve strict read-after-write behavior for a register, it is sometimes necessary to delay the execution of an instruction that reads the register by one or more clock cycles from when it would otherwise execute. An instruction so delayed is said to be stalled for the one or more clock cycles of delay. The advantages of implementing strict read-after-write behavior for all registers are:

- (1) The same sequence of instructions can execute correctly on a wider range of processor implementations, and
- (2) Assembly language programming is made easier.

For some processors, particularly those designed for high-performance digital signal processing and related embedded systems, the complexity of the processor has been reduced by not implementing strict read-after-write behavior for all registers. For such a processor, when an instruction reads a register it may not obtain the value written by the most recently executed instruction to write the register, but instead an older value of the register. Although the specific behavior is always deterministic and can be documented as a set of latency rules, for some processors these rules are quite complex taken together. For processors of this kind, unfortunately, assembly language programmers have invariably been burdened with observing the latency rules entirely on their own without any automated verification from the programming tools that they are applying the rules correctly. When the latency rules are complex, programmers naturally make avoidable errors that may not be discovered until program testing.
Latency Verification:
In FIG. 5, a process is described by which an assembler or other programming tool can verify that a processor's latency rules are being applied correctly by the programmer. First, an assembly language program is annotated by the programmer as follows:
For each register read by each instruction, a syntactic annotation is made in the program text to indicate which previous register value the programmer expects the instruction to obtain for the register. The lack of an annotation is either an error or indicates a default assumption. For example, the default assumption could be that the expected value obtained for a register is the value written by the most recent previously executed instruction that writes the register (i.e., the same as strict read-after-write behavior). Whenever the programmer expects a value different from the default assumption, an annotation is required. For example, if the programmer expects the value obtained for a register to be the value written by the n-th earlier instruction that writes the register (n>1), the annotation could be that a distinctive character be repeated n−1 times adjacent to the register denotation in the instruction. If, for example, this distinctive character is the dollar sign ($), then the assembly language instruction

- add r1, $$r2, $r3
  would indicate that the programmer expects the value obtained for register r3 to be the value written by the second previous instruction to write r3, and the value obtained for register r2 to be the value written by the third previous instruction to write r2. In the above example, the current value of register r1 and two values ago of register r2 and the previous value of register r3 are being added. The assembler or block 434 checks to ensure that all of these values are available by performing the process of FIG. 5. It should be noted that the annotation need not be a dollar sign, rather, it can be any notation.

Given these annotations, for each instruction, the assembler or other programming tool automatically determines whether the programmer's expectations are correct, by examining the sequences of instructions that can execute previous to the given instruction along all paths leading to the given instruction, and applying the documented latency rules to these sequences. FIG. 5 shows one possible manifestation of this process as part of a modified assembler. Block 502 has an instruction n together with a sequence of earlier instructions, as previously discussed above. Accordingly and given the hooks placed into the compiler and assembler of the present invention, programming is made simpler and programmer-friendly while or without reducing power.
Block 434 determines whether the latency annotations are correct for instruction n for this path, while block 520 performs the other usual functions of an assembler for instruction n. In block 508, the earlier instructions that contribute to the inputs of instruction n are identified. Block 506 determines, from the complete set of latency rules, those rules that are relevant to the interaction between each earlier instruction that contributes to the inputs of instruction n and instruction n itself. Where the instructions that may precede instruction n are unknown (for example, at the entrance to a subroutine), worst-case assumptions must be made (blocks 510 and 512). Finally, wherever the annotation made by the programmer does not match the behavior of the actual processor, as determined by the latency rules, an error is reported (blocks 514 and 516).
Although the present invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.

Claims

1. A software architecture for execution on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:

a scenario compiler for pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code, the scenario compiler including a plurality of applications, each application including one or more kernels, the scenario compiler pre-compiling the scenario for efficient execution thereof by a plurality of sub-processors, each sub-processor including a control circuit including high level code for execution thereof, the control circuit is a high language programmable controller for the sub-processor,

wherein a hierarchical compilation of different types of programming codes allow for efficient binary code creating while reducing power consumption when the binary code is executed by the sub-processors.

2. A software architecture, as recited in claim 1, further including a schedule and synchronization block communicating with the scenario compiler and for generating code, based on scenario description language (SDL) to operate with one or more of the sub-processors.

3. A software architecture, as recited in claim 2, further including a high level language compiler block receiving input from the synchronization block for compiling high level code.

4. A software architecture, as recited in claim 3, further including an assembler block coupled to receive information from the high level language compiler block and from an assembly code block, which provides assembly code written by a user, the assembler block for assembling the assembly code and the information received from the high level language compiler block.

5. A software architecture, as recited in claim 4, further including a binary code block for generating binary code based on assembly code, high level code and SDL.

6. A software architecture, as recited in claim 5, further including a scenario description and optional optimization block coupled to the scenario description block and upon the generation of binary code, a user's design goals are verified and if the design goals are not met, the scenario description and optional optimization block modifies the scenario.

7. A software architecture, as recited in claim 6, wherein the sub-processors each include applications having kernels, the kernels being engines for execution of computationally intensive code.

8. A software architecture, as recited in claim 7, further including a scenario description block coupled to the scenario compiler block for generating SDL for describing inter-dependencies between the kernals.

9. A software architecture, as recited in claim 8, further including a low-level assembler and linker block coupled to the optimizing assembler block for assembling the lowest-level code.

10. A software architecture, as recited in claim 9, wherein the low-level assembler and linker block further includes a latency verification block responsive to an N number of previous instructions and a current instruction for verifying the presence of N number of previous instructions used by a user for instructions requiring previous instructions.

11. A software architecture, as recited in claim 10, wherein the latency verification block for verifying the user's instruction, which includes use of previous instructions, against latency rules.

12. A software architecture, as recited in claim 11, further including shared memory coupled to the sub-processors wherein the kernel of one of the sub-processors hands off to another sub-processor by placing, in the shared memory, information to be used by the another sub-processor.

13. A method of generating and executing code on a heterogenous, high-performance, scalable processor having at least one W-type sub-processor capable of processing W bits, or more, in parallel, W being an integer value and having at least one N-type sub-processor capable of processing N bits in parallel, N being an integer value and smaller than, the software architecture comprising:

pre-compiling a scenario to create a binary code based on assembly code and high level language and scenario description language code;

generating efficient binary code to be executed by the sub-processors based on applications including kernels, the kernels for executing computationally intensive code, the execution of the binary code by the sub-processors causing reduction of power consumption and flexible coding options to a user.

14. A method of generating and executing code, as recited in claim 13, further including performing latency verification to prevent a user from using erroneous previous instructions.