WO2007049150A2 - Architecture for microprocessor-based systems including simd processing unit and associated systems and methods - Google Patents

Architecture for microprocessor-based systems including simd processing unit and associated systems and methods Download PDF

Info

Publication number
WO2007049150A2
WO2007049150A2 PCT/IB2006/003358 IB2006003358W WO2007049150A2 WO 2007049150 A2 WO2007049150 A2 WO 2007049150A2 IB 2006003358 W IB2006003358 W IB 2006003358W WO 2007049150 A2 WO2007049150 A2 WO 2007049150A2
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
pipeline
block
bit
instructions
Prior art date
Application number
PCT/IB2006/003358
Other languages
French (fr)
Other versions
WO2007049150A3 (en
Inventor
Kar-Lik Wonk
Carl Graham
Nigel Topham
Simon Jones
Aris Aristomdemou
Yazid Nemouchi
Seow Chuan Lim
Original Assignee
Arc International (Uk) Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Arc International (Uk) Limited filed Critical Arc International (Uk) Limited
Publication of WO2007049150A2 publication Critical patent/WO2007049150A2/en
Publication of WO2007049150A3 publication Critical patent/WO2007049150A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • G06F9/3875Pipelining a single stage, e.g. superpipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • G06F9/3897Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros with adaptable data path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/14Coding unit complexity, e.g. amount of activity or edge presence estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/43Hardware specially adapted for motion estimation or compensation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/436Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation using parallelised computational arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/523Motion estimation or motion compensation with sub-pixel accuracy
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • H04N19/82Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/86Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness

Definitions

  • the invention relates generally microprocessor architectures and more specifically to systems and methods for optimizing a microprocessor-based system including a SIMD processing unit.
  • Processor extension logic is utilized to extend a microprocessor's capability.
  • this logic is in parallel and accessible by the main processor pipeline. It is often used to perform specific, repetitive, computationally intensive functions thereby freeing up the main processor pipeline.
  • Processor extension logic may be implemented as an addition to supplement the capabilities of a main processor core. In some cases, this processor extension logic may be optimized to perform specific calculation intensive functions such as video processing and/or audio processing in order to free the main processor core to continue performing other functions.
  • SIMD Single instruction multiple data
  • DUCT discrete cosine transforms
  • filters Data parallelism exists when a large mass of data of uniform type needs the same instruction performed on it.
  • SIS single instruction single data
  • SMD architecture exploits parallelism in the data stream while SIS can only operate on data sequentially.
  • An example of an application that takes advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many media application.
  • One example of this is changing the brightness of a graphic image.
  • Each pixel of the image may consist of three values for the brightness of the red, green ad blue portions of the color.
  • the R, G and B values, or alternatively the SUV values are read from memory, a value is added to it, and the resulting value is written back to memory.
  • a SIMD processor enhances performance of this type of operation over that of a SIS processor.
  • a reason for this improvement is that that in SIMD architectures, data is understood to be in blocks and a number of values can be loaded at once.
  • SIMD processor Instead of a series of instructions to incrementally fetch individual pixels, a SIMD processor will have a single instruction that effectively says “get all these pixels” Another advantage of SIMD machines is multiple pieces of data are operated on simultaneously. Thus, a single instruction can say “perform this operations on all the pixels.” Thus, SIMD machines are much more efficient in exploiting data parallelism than SIS machines.
  • a video sequence consists of a number of still image frames presented in time sequence to create the appearance of continuous motion.
  • High quality video is usually comprised of thirty or more frames per second.
  • the amount of data required to represent even a single picture (still image) is derived by the frame's dimensions multiplied by the pixel depth.
  • 640x480 video with a pixel depth of 256 that is, 8 bits for each of the RIB or SUV elements of each pixel would require 0.9216 Megabytes per frame without compression.
  • At thirty frames per second that is a throughput of 27.648 Mbytes per second.
  • subsequent frames are often very similar in terms of their content, containing a lot of redundant data. When compressing video, this reduntant data is removed to achieve data compression.
  • motion compensation In video compression applications, motion compensation describes a current frame in terms of where each block of that frame came from in a previous frame. Motion compensation reduces the amount of data throughput required to reproduce video by describing frames by their measured change from previous and subsequent frames.
  • Various techniques exist for performing motion compensation A first approach is to simply subtract a reference frame from a given frame. The difference is called residual and usually contains less information than the original frame. Thus, rather then encoding the frame, only the residual is encoded. The residual can be encoded at a lower bit-rate without degrading the image quality. The decoder can reconstruct the original frame by simply adding the reference frame again.
  • Another technique is to estimate the motion of the whole scene and the objects in a video sequence.
  • the motion is described by some parameters that have to be encoded in the bit-stream.
  • the blocks of the predicted frame are approximated by appropriately translated blocks of the reference frame. This gives more accurate residuals than a simple subtraction.
  • the bit-rate occupied by the parameters of the motion model can become quite large. This runs contrary to the goal of achieving high compression ratios.
  • Video frames are often processed in groups.
  • One frame (usually the first) is encoded without motion compensation just as a normal image, that is, without compression. This frame is called I-frame or I-picture.
  • the other frames are called P-frames or P- pictures and are predicted from the I-frame or P-frame that comes (temporally) immediately before it.
  • the prediction schemes are, for instance, described as IPPPP, meaning that a group consists of one I-frame followed by four P-frames.
  • Frames can also be predicted from future frames.
  • the future frames then need to be encoded before the predicted frames and thus, the encoding order does not necessarily match the real frame order.
  • Such predicted frames are usually predicted from two directions, i.e. from the I- or P-frames that immediately precede or follow the predicted frame. These bidirectionally predicted frames are called B-frames.
  • frames are partitioned in blocks of pixels (e.g. macroblocks of 16x 16 pixels in MPEG). Each block is predicted from a block of equal size in the reference frame. The blocks are not transformed in any way apart from being shifted to the position of the predicted block. This shift is represented by a motion vector. The motion vectors are the parameters of this motion compensation model and have to be encoded into the bit-stream.
  • Existing block matching methods may be performed in software, or may be implemented by a special-purpose hardware device. Software implementations have the disadvantage of being slow, whereas hardware solutions often lack the flexibility needed to support a wide range of different video encoding standards. A specific problem associated with both software and hardware techniques is that of memory alignment.
  • the pixels of the reference frame should be retrieved from memory in groups of 8 or even 16.
  • blocks of pixels from the reference frame are not guaranteed to be located in memory at an address that is an integer multiple of 8. This may require non-aligned accesses, with extra hardware and additional memory access cycles, and is therefore one problem with existing methods.
  • Numerical computation algorithms such as those common in video encoding/decoding, often require results to be clipped to be within a specified range of values. For example, in video processing, a system will have a maximum pixel depth depending on the system's resolution. If the value of an intermediate calculation result, such as interpolation or other calculation, lies outside the maximum value the final result will have to be clipped to the saturation value, for example, the maximum pixel value.
  • Such a software clipping implementation incurs a high overhead due to the number of calculations required to test each value.
  • the sequential nature of a software implementation makes it very difficult to be optimized in processors designed to exploit instruction level parallelism, such as, for example, SIS reduced instruction set (RISC) machines or very long instruction word (VLIW) machines.
  • RISC reduced instruction set
  • VLIW very long instruction word
  • SIMD system can require additional memory registers to support data which increases processor complexity and cost or they share resources such as registers with processing units of the CPU. This can cause competition for resources, conflicts, pipeline stalls and other events that adversely effect overall processor performance.
  • a major disadvantage of SlMD architecture is the rigid requirement on data arrangement. The overhead to rearrange data in order to exploit data parallelism can significantly impact the speedup in computation and can even negate the performance gain achievable by a SIMD machine in comparison to a conventional SIS machine. Also, attaching a SMD machine as an extension to a conventional SIS machine can cause various issues like synchronization, decoupling, etc. Thus, there exists a need for a SIMD-based architecture that fully exploit the advantages of parallelism without suffering from the design complexity or other shortcomings of conventional systems.
  • At least one embodiment of the invention provides a method of dynamically decoupling a parallel extended processor pipeline from a main processor pipeline.
  • the method according to this embodiment comprises sending an instruction from the main processor pipeline to the parallel extended processor pipeline instructing the parallel extended processor pipeline to operate autonomously, operating the parallel extended processor pipeline autonomously, storing subsequent instructions from the main processor pipeline to the parallel extended processor pipeline in an instruction queue, executing an instruction with the parallel extended processor pipeline to cease autonomous execution, and thereafter executing instructions supplied by the main processor pipeline in the queue.
  • At least one other embodiment of the invention provides a microprocessor architecture.
  • the microprocessor architecture according to this embodiment comprises a main instruction pipeline, and an extended instruction pipeline, wherein the main instruction pipeline is configured to issue a begin record instruction to the extended instruction pipeline, causing the extended instruction pipeline to begin recording a sequence of instructions issued by the main instruction pipeline.
  • Another embodiment of the invention provides a method for synchronization of multiple processing engines in an extended processor core.
  • the method comprises placing direct memory access (DMA) functionality in a single instruction multiple data (SIMD) pipeline, where the DMA functionality comprises a data-in engine and a data-out engine, and each DMA engine is allowed to buffer at least one instruction issued to it in a queue without stopping the SIMD pipeline.
  • the method may also comprise, when the DMA engine queue is full, and a new DMA instruction is trying to enter the queue, blocking the SIMD pipeline from executing any instructions that follow until the current DMA operation is complete, thereby allowing the DMA engine and SIMD pipeline to maximize parallel operation while still remaining synchronized.
  • Another embodiment of the invention provides a method of performing block matching with a systolic array.
  • the method comprises selecting an NxN target pixel block, selecting an NxN reference block from a starting point of an NxM reference block search space, propagating the target and reference blocks through N cycles to completely load the target and reference blocks into an array of systolic cells, computing a sum of absolute difference (SOAD) between a pixel of the target block and the reference block for each N rows of the array, saving the SOAD for the current reference block, incrementing to the next reference block of the NxM reference block search space and selecting a new NxN reference block, repeating the propagating, computing, saving and incrementing steps until all blocks in the reference search space have been tested, and selecting the block from the reference search space having the lowest SOAD as a motion vector for the target block.
  • SOAD sum of absolute difference
  • Another embodiment of the invention may provide a method of causing a microprocessor to perform a clip operation.
  • the method according to this embodiment may comprise providing an assembly instruction to the microprocessor, the instruction comprising an input address, an output address and a controlling parameter, decoding the instruction with logic in the microprocessor, retrieving a data input from the input address, determining a specific clip operation based on the controlling parameter, performing the clip operation on the data input, and writing the result to output address.
  • a further embodiment of the invention provides a method of causing a microprocessor to perform a CODEC deblocking operation on a horizontal row of image pixels.
  • the method comprises providing a first instruction to the microprocessor having three 128-bit operands comprising the 16-bit components of a horizontal row of pixels in a SUV image as a first input operand, wherein the horizontal row of pixels are in image order and include four pixels on either side of a pixel block edge, at least one filter threshold parameter as a second input operand, and a 128-bit destination operand for storing the output of the first instruction as a third operand, calculating an output value of the first instruction, and storing the output value of the first instruction in the 128-bit destination register.
  • the method according to this embodiment also comprises providing a second instruction to the microprocessor having three 128-bit operands comprising the first input operand of the first instruction as the first input operand, the output of the first instruction as a second input operand, and a destination operand of a 128-bit register for storing an output of the second instruction as the third operand, calculating an output value of the second instruction, and storing the output value in the 128-bit register specified by the destination operand of the second instruction.
  • An additional embodiment according to the invention provides a method for accelerating sub-pixel interpolation in a SHvID processor.
  • the method according to this embodiment comprises determining a set of interpolated pixel values corresponding to sub-pixel positions horizontally between integer pixel locations, inputting the set of interpolated pixel values into an N lane SIMD data path, where N is an integer indicating the number of parallel, 16-bit data lanes of the SIMD processor, performing separate, simultaneous intermediate filter operations in each lane of the N lane data path, summing the intermediate results and perform right shifting to the sums, and outputting a value of two adjacent sub-pixels located diagonally with respect to integer pixel locations and in the same row as the set of interpolated pixel values.
  • Figure 1 is a functional block diagram illustrating a microprocessor-based system including a main processor core and a SIMD media accelerator according to at least one embodiment of the invention
  • Figure 2 is a block diagram illustrating a conventional multistage microprocessor pipeline having a pair of parallel data paths
  • Figure 3 is a block diagram illustrating another conventional multiprocessor design having a pair of parallel processor pipelines;
  • Figure 4 is a block diagram illustrating a dynamically decoupleable multi-stage microprocessor pipeline according to at least one embodiment of the invention;
  • Figure 5 is a flow chart detailing the steps of a method for sending instructions for operating a main processor pipeline and an extended processor pipeline according to at least one embodiment of the invention
  • Figure 6 is a flow chart detailing the steps of a method for dynamically decoupling an extended processor pipeline from a main pipeline according to at least one embodiment of the invention.
  • Figure 7 s a code fragment containing an example of a processor extension instruction sequence that is issued to the processor extension in accordance with various embodiments of the invention
  • Figure 8 s a code fragment in which a processor extension instruction is preloaded to a memory location and then run from that location by the processor extension in accordance with various embodiments of the invention
  • Figure 9 is a code fragment containing an example of an extension instruction sequence that is being issued and simultaneously captured and recorded in accordance with at least one embodiment of the invention.
  • Figure 10 is a flow chart of an exemplary method for recording instruction in an extended instruction pipeline and using such recorded instructions according to at least one embodiment of the invention.
  • Figure 11 is an instruction sequence flow diagram and corresponding event time line illustrating a method for synchronizing processing between DMA tasks and SIMD tasks according to at least one embodiment of the invention
  • Figure 12 is a flow chart detailing steps of an exemplary method for synchronizing multiple processing engines in a microprocessor according to various embodiments of the invention.
  • Figure 13 is a block diagram illustrating an architecture for a systolic array-based block matching system and method according to at least one embodiment of the invention.
  • Figure 14 is a diagram of a cell of systolic array according to at least one embodiment of the invention.
  • Figure 15 is a block circuit diagram of the components of a systolic cell according to an embodiment of the invention.
  • Figure 16 is a flow chart of an exemplary method for performing block matching in accordance with at least one embodiment of the invention.
  • Figure 17 is a diagram of a systolic array and 64-bit word taken from the search space to be compared using the block matching method according to at least one embodiment of the invention.
  • Figure 18 is an exemplary 8x8 target pixel block and exemplary Nx8 search space comprised of (N-7) 64-bit search words that the target pixel block is matched against according to at least one embodiment of the invention.
  • Figure 19 is a diagram illustrating the components of a parameterizable clip instruction for either SIS or SIMD processor architectures according to at least one embodiment of the invention.
  • Figure 20 illustrates the format of a 32-bit parameter input to the parameterizable clip instruction of Figure 19 according to at least one embodiment of the invention
  • Figure 21 is a table illustrating the ways in which the parameters of the parameterizable clip instruction may be specified
  • Figure 22 is a flow chart of an exemplary method of performing a clip operation with a parameterizable clip instruction according to at least one embodiment of the invention
  • Figure 23 is a pair of SIMD instructions that are each pipelined to a single slot cycle with a three cycle latency for implementing the H.264 deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention
  • Figure 24 is a block diagram illustrating the contents of a 128-bit register containing the first input operand to the deblock instruction of Figure 23 according to at least one embodiment of the invention
  • Figure 25 is a block diagram illustrating the contents of a 128-bit register containing the second input operand to the deblock instruction of Figure 23 according to at least one embodiment of the invention
  • Figure 26 is a block diagram illustrating the contents of a 128-bit register containing the output of the first deblock instruction split into eight 16-bit fields which is used as the second input operand to the second deblock instruction according to at least one embodiment of the invention
  • Figure 27 is a pixel diagram illustrating the 4x8 block of pixels for processing with a pair of deblock instructions according to at least one embodiment of the invention
  • Figure 28 is a pair of single-cycle SBVID assembler instructions for implementing the VCl deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention
  • Figure 29 is a block diagram illustrating the contents of a 128-bit register containing the first input operand to the deblock instruction of Figure 28 according to at least one embodiment of the invention
  • Figure 30 is a block diagram illustrating the contents of a 128-bit register containing the second input operand, the VCl filter quantization parameter, to the deblock instruction, of Figure 28 according to at least one embodiment of the invention
  • Figure 31 is a block diagram illustrating the contents of a 128-bit register containing the output of the first deblock instruction split into eight 16-bit fields with data in the first five fields which is used as the second input operand to the second deblock instruction according to at least one embodiment of the invention
  • Figure 32 is a schematic diagram illustrating a SIMD topology for performing homogeneously parallel mathematical operations
  • Figure 33 is a diagram illustrating an array of pixels for describing an implementation of a filter-based interpolation method for performing sub-pixel interpolation in the H.264 video codec standard according to at least one embodiment of the invention.
  • Figure 34 is a schematic diagram illustrating a SIMD topology for performing heterogeneously parallel mathematical operations according to at least one embodiment of the invention.
  • Figure 35 is a flow chart of an exemplary method for accelerating sub-pixel interpolation according to at least one embodiment of the invention.
  • FIG. 1 a functional block diagram illustrating a microprocessor- based system 5 including a main processor core 10 and a SIMD media accelerator 50 according to at least one embodiment of the invention is provided.
  • the diagram illustrates a microprocessor 5 comprising a standard single instruction single data (SIS) processor core 10 having a multistage instruction pipeline 12 and a SIMD media engine 50.
  • the processor core 10 may be a processor core such as the ARC 700 embedded processor core available from ARC International Ltd. of Elstree, United Kingdom, and as described in provisional patent application number 60/572,238 filed May 19, 2004 entitled "Microprocessor Architecture" which, is hereby incorporated by reference in its entirety.
  • the processor core may be a different processor core.
  • a single instruction issued by the processor pipeline 12 may cause up to sixteen 16-bit elements to be operated on in parallel through the use of the 128-bit data path 55 in the media engine 50.
  • the SIMD engine 50 utilizes closely coupled memory units.
  • the SIMD data memory 52 (SDM) is a 128-bit wide data memory that provides low latency access to perform loads to and stores from the 128-bit vector register file 51. The SDM contents are transferable via a DMA unit 54 thereby freeing up the processor core 10 and the SIMD core 50.
  • the DMA unit 54 comprises a DMA in engine 61 and a DMA out engine 62.
  • both the DMA in engine 61 and DMA out engine 62 may comprise instruction queues (labeled Q in the Figure) for buffering one or more instructions.
  • a SMD code memory 56 allows the SIMD unit to fetch instructions from a localized code memory, allowing the SIMD pipeline to dynamically decouple from the processor core 10 resulting in truly parallel operation between the processor core and SMD media engine as discussed in commonly assigned U.S. Patent Application No. XX/XXX,XXX, titled, "Systems and Methods for Recording Instruction Sequences in a Microprocessor Having a Dynamically Decoupleable Extended Instruction Pipeline," filed concurrently herewith, the disclosure of which is hereby incorporated by reference in its entirety.
  • the microprocessor architecture may permit the processor to operate in both closely coupled and decoupled modes of operation.
  • the SEVID program code fetch and program stream supply is exclusively handled by the processor core 10.
  • the SIMD pipeline 53 executes code from a local memory 56 independent of the processor core 10.
  • the processor core 10 may control the SMD pipeline 53 to execute video tasks such as audio processing, entropy encoding/decoding, discrete cosine transforms (DCTs) and inverse DCTs, motion compensation and de-block filtering.
  • DCTs discrete cosine transforms
  • the main processor pipeline 12 has been extended with a high performance SMD engine 50 and two direct memory access (DMA) engines 61 and 62, one for moving data into a local memory, SIMD data memory (SDM), and one for moving data out of local memory.
  • the SMD engine 50 and DMA engines 61, 62 are all executing instructions that are fetched and issued from in the main processor pipeline 10. To achieve high performance, these individual engines need to be able operate in parallel, and hence, as discussed above, instruction queues (Q) are placed between the main processor core 10 and the SIMD engine 50, and between the SIMD 50 engine and the DMA engines 61, 62, so that they can all operate out of step of each other.
  • a local SIMD code memory (SCM) is introduced so that macros can be called and can be executed from these memories. This allows the main processor core, the SIMD engines and the DMA engines to execute out of step of each other.
  • FIG. 2 a block diagram illustrating a conventional multistage microprocessor pipeline having a pair of parallel data paths is depicted.
  • data paths required to support different instructions typically have a different number of stages.
  • Data paths supporting specialized extension instructions for performing digital signal processing or other complex but repetitive functions may be used only some of the time during processor execution and remain idle otherwise. Thus, whether or not these instructions are currently needed will effect the number of effective stages in the processor pipeline.
  • Extending a general-purpose microprocessor with application specific extension instructions can often add significant length to the instruction pipeline, m the pipeline of Figure 2, pipeline stages Fl to F4 at the front end 100 of the processor pipeline are responsible for functions such as instruction fetch, decode and issue. These pipeline stages are used to handle all instructions issued by the microprocessor. After these stages, the pipeline splits into parallel data paths 110 and 115 incorporating stages El- E3 and D1-D4 respectively. These parallel sub-paths represent pipeline stages used to support different instructions/data operations. For example, stages E1-E3 may be the primary/default processor pipeline, while stages D1-D4 comprise the extended pipeline designed for processing specific instructions.
  • This type of architecture can be characterized as coupled or tightly coupled to the extent that regardless of whether instructions are destined for default pipeline stages E1-E3 or extended pipeline D1-D4, they all must pass through stages F1-F4, until a decision is made as to which portion of the pipeline will perform the remaining processing steps.
  • the processor pipeline of Figure 2 achieves the advantage that instructions can be freely intermixed, irrespectively of whether the instructions are executed by the data path in sub-paths E1-E3 or D1-D4. Thus, all instructions appear as a single thread of program execution.
  • This type of pipeline architecture also has the advantage of greatly simplified program design and debugging, thereby reducing the time to market in product developments. It is admittedly a highly flexible architecture. However, a limitation of this architecture is that the sequential nature of instruction execution significantly limits the exploitable parallelism between the data paths that could otherwise be used to improve overall performance. This negatively affects performance relative to other parallel pipeline architectures.
  • FIG. 3 is a block diagram illustrating another conventional multiprocessor architecture having a pair of parallel instruction pipelines.
  • the processor pipeline of Figure 3 contains a front end 120 comprised of stages F1-F4 and a rear portion 125 comprised of stages E1-E3.
  • the processor also contains a parallel data path having a front end 135 comprised of front end stages G1-G2 and rear portion 140 comprised of stages D1-D4.
  • this architecture contains truly parallel pipelines to the extent that both front portions 420 and 435 each can fetch instructions separately.
  • This type of parallel architecture may be characterized as loosely coupled or decoupled because the application specific extension data path G1-G2 and D1-D4 is autonomous and can execute instructions in parallel to the main pipeline consisting of F1-F4 and E1-E3.
  • This arrangement enhances exploitable parallelism over the architecture depicted in Figure 2.
  • mechanisms are required to synchronize their operations, as represented by dashed line 130.
  • These mechanisms typically implemented using specific instructions and bus structures which, are often not a natural part of a program and are inserted as after-thoughts to "fix" the disconnect between main pipeline and extended pipeline. As consequence of this, the resulting program utilizing both instruction pipelines becomes difficult to design and optimize.
  • FIG. 4 a block diagram illustrating a dynamically decoupleable multi-stage microprocessor pipeline according to at least one embodiment of the invention is provided.
  • the pipeline architecture according to this embodiment ameliorates at least some and preferably most or all of the above-noted limitations of conventional parallel pipeline architectures.
  • This exemplary pipeline depicted in Figure 4 consists of a front end portion 145 comprising stages F1-F4, a rear portion 150 comprising stages E1-E3, and a parallel extendible pipeline having a front portion 160 comprising stages G1-G2 and a rear portion 165 comprising stages D1-D4.
  • instructions can be issued from the CPU to the extendible pipeline Dl to D4.
  • a queue 155 is added between the two pipelines.
  • the queue serves to delay execution of instructions issued by the front end portion 145 of the main pipeline if the extension pipeline is not ready.
  • a tradeoff can be made during system design to decide on how many entries should be in the queue 155 to ensure that the extension pipeline is sufficiently decoupled from the main pipeline.
  • the main pipeline can issue a Sequence Run (vmri) instruction to instruct the extension pipeline to use its own front end 160, Gl to G2 in the diagram, to execute instruction sequences stored in a record memory 156, causing the extension pipeline to fetch and execute instructions autonomously.
  • vmri Sequence Run
  • the main pipeline can keep issuing extension instructions that accumulate in the queue 155 until the extension pipeline executes a Sequence Record End (vendrec) instruction. After the sangc instruction is issued, the extension resumes executing instructions issued to the queue 155.
  • vendor Sequence Record End
  • the pipeline depicted in Figure 4 is designed to switch between being coupled, that is, executing instructions for the main pipeline front end 145, and being decoupled, that is, during autonomous runtime of the extended pipeline.
  • the instructions vrun and structuric which dynamically switch the pipeline between the coupling states, can be designed to be light weight, executing in, for example, a single cycle.
  • These instructions can then be seen as parallel analogs of the conventional call and return instructions. That is, when instructing the extension pipeline to fetch and execute instructions autonomously, the main processor pipeline is issuing a parallel function call that runs concurrently with its own thread of instruction execution to maximize speedup of the application.
  • the two threads of instruction execution eventually join back into one after the extension pipeline executes the motifc instruction which is the last instruction of the program thread autonomously executed by the extension pipeline.
  • processor pipeline containing a parallel extendible pipeline that can be dynamically coupled and decoupled is the ability to use two separate clock domains, hi low power applications, it is often necessary to run specific parts of the integrated circuit at varying clock frequencies, in order to reduce and/or minimize power consumption.
  • the front end portion 145 of the main pipeline can utilize an operating clock frequency different from that of the parallel pipeline 165 of stages D1-D4 with the primary clock partitioning occurring naturally at the queue 155 labeled as Q in the Figure 4.
  • step 200 a flow chart of an exemplary method for sending instructions from a main processor pipeline to an extended processor pipeline according to at least one embodiment of the invention is depicted. Operation of the method begins in step 200 and proceeds to step 205, where an instruction is fetched by the main processor pipeline.
  • step 210 because the instruction is determined to be one for processing by the parallel extended pipeline, the instruction is passed from the main pipeline to the parallel extended pipeline via an instruction queue coupling the two pipelines.
  • the parallel extended pipeline is currently processing instructions from the queue, that instruction will be processed in turn by the parallel extended pipeline as specified in step 220. Otherwise, the instruction will remain in the queue until the parallel extended pipeline has ceased its autonomous operation.
  • step 225 while the instruction is either sitting in the queue or being processed by the parallel pipeline, the main pipeline is able to continue processing instructions.
  • the queue provides a mechanism for the main pipeline to offload instructions to the parallel extended pipeline without stalling the main pipeline. Operation of the method stops in step 230.
  • this Figure is a flow chart of an exemplary method for dynamically decoupling an extended processor pipeline from a main pipeline according to at least one embodiment of the invention. Operation of the method begins in step 300 and proceeds to step 305 where the main processor pipeline sends a run instruction to the parallel extended pipeline via the instruction queue coupling the pipelines.
  • step 310 the parallel pipeline retrieves the run instruction from the queue.
  • this run instruction will specify a location in a record memory accessible by the parallel extended pipeline of a starting location of a sequence of recorded instructions.
  • the parallel extended pipeline begins executing the series of recorded instructions, that is, it begins autonomous operation, hi various embodiments this comprises fetching and executing its own instructions independent of the main pipeline.
  • the parallel extended pipeline may operate at another clock frequency to that of the main pipeline, such as, for example, a fractional percentage (i.e., 1 A, 1 A, etc.).
  • the main processor pipeline can continue sending instructions to the parallel extended pipeline as depicted in step 320. Then, in step 325, after the parallel pipeline has processed an end instruction recorded at the end of the sequence of recorded instructions, autonomous operation of that pipeline ceases, hi step 330, the parallel pipeline returns to the queue to process any queued instructions received from the main pipeline. In step 335, the parallel extended pipeline continues processing instructions issued by the main pipeline that appear in the queue until an instruction to begin autonomous operation is received.
  • processor extensions typically supports specialized instructions that hugely accelerate the computation required by the application that the instruction is designed for.
  • SIMD extension instructions can be added to a processor to improve performance of applications with high degree of data parallelism.
  • the instructions can be issued directly from the CPU or main processor pipeline to the processor extension through a tightly coupled interface as discussed above in the context of Figure 2.
  • the CPU can preload the instructions into specific memory locations and the processor extension is then instructed by the CPU to fetch and execute the preloaded instructions from memory so that the processor extensions are largely decoupled from the CPU, as discussed in the context of Figure 3.
  • processor extension instructions are issued by the CPU (main processor pipeline) and dynamically captured into a processor extension memory or processor extension instruction buffer/queue for subsequent retrieval and playback.
  • processor extension instructions can optionally be executed by the processor extensions as they are captured and recorded.
  • an extension instruction sequence can be preloaded into some specific memory location from which the processor extension logic is directed to fetch such instructions, as shown in code fragment B in Figure 8.
  • code fragment B the extension instruction sequence is preloaded to location LlOO and then a Sequence Run (vruri) instruction is issued in statement L5 to direct the processor extension to fetch and execute the sequence.
  • vruri Sequence Run
  • each instruction has first to be loaded into a register in the CPU and then stored at the desired location, requiring at least 2 instructions (a load and a store).
  • the extension instruction sequence is adaptive, that is, based upon the run-time conditions in the CPU, the preloading routine, referred to as the preloader, would need linking functionalities to modify the sequence while preloading. Such functionalities add to the preloading overhead.
  • An example of adaptation is L2 in code fragment A of Figure 5 in which a CPU register rlO is read in addition to the extension register vrOl. The cumulative effect of all these overheads can significantly reduce application performance if the extension instruction sequences have to be dynamically reloaded relatively frequently as is likely in video processing applications.
  • this invention introduces a scheme by which, instead of preloading, extension instruction sequences can be captured on-the-fly, that is, while such instructions are being issued from the CPU, and recorded to specific memory locations accessible by the extension logic.
  • the instructions being recorded can also be optionally executed by the processor extension, further reducing the recording overhead.
  • the CPU can read its own register rlO and its value is issued directly to the processor extension together with the instruction and recorded into the macro.
  • the breq instruction in statement L2A is actually a conditional branch instruction of the CPU that depends on the contents of the CPU registers r4 and r5. If this branch is taken, the vaddw instruction in statement L2B will not be issued to the processor extension and hence not recorded.
  • a mechanism is used to keep track of address locations in the SCM such that during the recording of subsequent additional instruction sequences, previous instruction sequences are not overwritten and such that different instruction sequence start addresses are maintained by the main processor core.
  • a further advantage of instruction recording over preloading is the elimination of the requirement to load the extension instruction sequences into data cache using the preloader, which would have polluted the data cache and thereby reduce overall efficiency of the CPU. Furthermore, by replacing the vrec instruction in statement Ll A by the Sequence Record And Run (vrecmri) instruction, the instruction being captured and recorded is also executed by the processor extension and the overhead of instruction recording is thereby reduced or even minimized.
  • an instruction macro can be used in the same way as a preloaded instruction sequence and has the same benefits of code reuse and simplifying decoupled execution.
  • the record mechanism can coexist with the preloading mechanism, that is, the two mechanisms are not necessarily mutually exclusive. As an example, preloading may still be useful for preloading macros that do not require frequent reloading in runtime.
  • the processor extension in order to increase and ideally maximize flexibility, can operate in one of two modes.
  • the processor extension after executing the Sequence Run (yru ⁇ ) instruction, the processor extension may switch to an autonomous mode in which it fetches and execute instructions in a pre-recorded macro on its own.
  • the processor extension After executing the Sequence Record End (vendrec) instruction that signifies the end of an instruction macro, the processor extension may switch back to the normal operating mode, in which the CPU provides all further processor extension instructions.
  • this recording scheme combines all the benefits of direct instruction issuing and preloading.
  • this Figure is a flow chart of an exemplary method for recording instructions in an extended instruction pipeline and using such recorded instructions according to at least one embodiment of the invention.
  • the method begins in step 400 and proceeds to step 405, where the main processor pipeline issues a record instruction sequence instruction to the extended instruction pipeline.
  • this record sequence instruction will specify a starting memory address.
  • the extended pipeline begins recording the sequence of instructions following the record instruction in a memory structure accessible by the extended pipeline at the starting location specified in the record instruction. It should be appreciated that, as discussed herein, in step 410 the extended pipeline may also begin executing the sequence of instruction in addition to recording them.
  • step 415 the main pipeline issues the record end instruction to the extended pipeline causing the latter to stop recording the instruction sequence.
  • the extended instruction pipeline may record the end record instruction as the last instruction in the current sequence.
  • the main processor pipeline can call the instruction sequence with a single run instruction and effectively decouple the extended pipeline from the main pipeline, as exemplified in the remaining method steps of FIG. 10.
  • step 425 the main processor pipeline calls the recorded instruction sequence, hi various embodiments as illustrated in Figures 8-9 and discussed in the corresponding description, this is accomplished by issuing a run instruction that specifies the start address of the instruction sequence. In this manner, different sequences may be called with the same run instruction by specifying different start addresses.
  • the main pipeline effectively decouples the extended pipeline so that the latter may begin fetching and executing instructions autonomously, as stated in step 430.
  • the extended pipeline has its own front end for this purpose.
  • the extended pipeline will continue operating in the autonomous mode, that is independent of main pipeline's fetch-execution cycles, until the "end" or “record end” instruction that was previously recorded at the end of the current instruction sequence is encountered, hi various embodiments, this instruction will cause the extended pipeline to cease autonomous execution and, as stated in step 435, to resume executing instructions issued by the main pipeline via the queue.
  • a main processor pipeline is extended through a dynamically coupled parallel SIMD instruction pipeline.
  • the main processor pipeline may issue instructions to the extended pipeline through an instruction queue that effectively decouples the extended pipeline
  • the extended SIMD pipeline is also able to run prerecorded macros that are stored in a local SMD instruction memory so that a single macro instruction sent to the SIMD pipeline via the queue allows many pre-determined instructions to be executed.
  • This architecture allows the SIMD media engine (the extended pipeline) to operate in parallel with the primary pipeline (processor core) and allows the processor core to operate far in advance of the parallel SBvID pipeline.
  • the SIMD pipeline queue uses condition codes to notify the processor pipeline of the condition of the queue.
  • the SIMD queue sets a condition code of QF for queue nearly full whenever there are less than a predetermined number of empty slots remaining in the queue. In various embodiments, this number may be 16. However, in various embodiments, the number may be different than 16.
  • the SIMD queue sets a condition code of QNF as the opposite of QF when more than the predetermined number of slots remain available.
  • condition codes rather than using several instructions to load these status values and test the value before branching on the test result, two conditional branch instructions using these condition codes directly test for such conditions, thereby reducing the number of instructions required to perform this task.
  • these instructions will only branch when the condition code used is set.
  • these instructions may have the mnemonic "BQF" for branch when queue is nearly full and "BQNF" for branch when queue is not nearly full.
  • Such condition codes make the queue full status an integral part of the main processor programming model and make it possible to make frequent light-weight intelligent decisions by software to maximize overall performance. These condition codes are maintained by the queue itself based on the queue's status.
  • the instruction to check the condition code are branch instructions that are specified to check the particular condition codes.
  • checking of the condition code is done by placing condition code checking branch instructions where necessary, such as before issuing any instructions to the extended pipeline.
  • the condition codes provide an easy mechanism for preventing main pipeline stalls caused by trying to issue instructions to a full queue.
  • These two conditional branch instructions allow the main processor pipeline to regularly check the status of the queue before issuing more instructions into the extended SIMD pipeline queue.
  • the main processor core can use these instructions to avoid stalling the processor when the queue is full or nearly full, and branch to another task that does not involve the SIMD engine until these queue conditions change. Therefore, these instructions provide the processor with an effective and relatively low overhead means of scheduling work load on the available resources while preventing main pipeline stalls.
  • the DMA engines 61, 62 ( Figure 1) are placed in the SIMD pipeline 53 itself, but each DMA engine is allowed to buffer one or more instructions issued to it in a queue without stopping the SIMD pipeline execution.
  • the SIMD engine pipeline 53 will be blocked from executing further instructions only when another DMA instruction arrives at the DMA. This allows the software to be re-organized so that a SIMD code will have to wait for a DMA operation to complete, or vice versa, as long as a double or more buffering approach is used, that is, two or more buffers are used to allow overlapping of data transfer and data computation.
  • each DMA channel is allowed to buffer at least one instruction in a queue.
  • a queue there are two independent video pixel data blocks to be processed, and that each requires multiple blocks of pixel data to be moved into local memory and to be processed, before moving the results out of local memory.
  • this Figure illustrates an instruction sequence flow diagram 450 and corresponding event time line 455 illustrating a method for synchronizing processing between DMA tasks and SIMD tasks, with only one deep instruction queues in each DMA engines, according to at least one embodiment of the invention.
  • the DI2 DMA operation is blocked if the buffered DIl DMA operation is not completed, causing the DI2 DMA instruction to be blocked from entering the DMA instruction queue, which in turn results in the Sl SIMD operation being blocked. Since Sl operation depends on data from DIl operation, the blocking action prevents the Sl SIMD instruction sequence from proceeding until the DIl operation is completed.
  • the DI3 DMA operation is executed only after Sl is completed.
  • This approach avoids the need of the main processor core from intervening continuously in order to achieve synchronization between the DMA unit and the SIMD pipeline.
  • the processor core 10 does need to ensure that the instruction sequence sent uses this functionality to achieve the best performance by parallelizing SEVID and DMA operations.
  • an advantage of this approach is that it facilitates the synchronization of SIMD and DMA operations in a multi-engine video processing core with minimal interaction between the main control processor core.
  • This approach can be extended by increasing the depth of the DMA non-blocking instruction queue so as to allow more DMA instructions to be buffered in the DMA channels, allowing double, triple or more buffering.
  • FIG. 12 is a flow chart of an exemplary method for synchronizing multiple processing engines in a microprocessor-based system according to at least one embodiment of the invention.
  • Figure 12 demonstrates a method for coding the instruction sequence to allow both the SIMD engine and DMA engines to operate simultaneously as much as possible.
  • the method begins in step 500 and proceeds to step 505 where an instruction requiring the DMA engine is executed by the SIMD pipeline.
  • step 510 the SIMD pipeline accesses the required DMA engine queue. If in step 510, the DMA engine instruction queue is already full when it is accessed, the SIMD pipeline is paused from further execution, as described in step 515.
  • the SIMD waits for a free space in the instruction queue of the targeted DMA engine.
  • the DMA engine corresponding to the target queue performs its current DMA operation instructed by the DMA instruction(s) already in the queue.
  • the DMA engine instruction queue opens up a free space so that in step 525, the stalled DMA instruction can be buffered in the queue.
  • the SIMD pipeline then resumes execution in step 530 after the DMA instruction has been buffered. Accordingly, through the various systems and methods disclosed herein, simultaneous operation of the SIMD pipeline and the DMA engines is maximized without the risk of overwrite.
  • Block matching achieves high video compression ratios by finding a region in a previously encoded frame that is a close match for each macro block in a current video frame.
  • the spatial offset between the current block and reference block is called a motion vector.
  • Block matching algorithms compute the pixel-by-pixel difference between a selected block of a reference frame and current block. Temporal redundancy between blocks in subsequent frames allows the encoder to encode the video without encoding the pixel values of each block.
  • various embodiments of the invention provide a flexible and efficient systolic array-based block matching algorithm that can be configured to match blocks of size 4x4, 4x8, 8x4 and 8x8 pixels, etc., to provide support for variable block sizes in the H.264 and most other modern video codec standards.
  • FIG 13 a block diagram illustrating an architecture for performing the sum of absolute difference (SOAD) calculation for block matching according to at least one embodiment of the invention is depicted.
  • the architecture 600 consists of four primary components: the sequencer 605, the block matching unit 610, the reference picture buffer 615, and the DMA unit 620.
  • the sequencer 605 functions as the control unit executing the search sequence.
  • the block matching unit 610 comprises a wide data path, which, in various embodiments, is able to load 8 reference pixels and cycle through an 8x8 block, such that each row of reference pixels is used 8 times.
  • the reference picture buffer 615 is used to store a large number (several blocks) of reference pixels to reduce the required bandwidth between the block matching unit 610 and system memory.
  • the DMA unit 600 may comprise a pair of DMA units such as, for example, data input and data output units.
  • each cell 650 of the systolic array computes an 8-bit absolute difference between target and reference pixels. As the target and reference pixels move across the systolic array, the absolute difference of each such pair of pixels is computed and accumulated. In various embodiments, each row computes the sum of the 8 cell results. Thus, eight cycles after starting a block calculation, a row will produce a block sum of absolute difference (SOAD) result.
  • SOAD block sum of absolute difference
  • X t represents the pixel of the target block at time t
  • r t represents the pixel of the reference block at time t.
  • the start of each row calculation is staggered by 1 clock cycle so that results are emitted 1 per cycle.
  • Various embodiments will employ an 8x8 array of systolic cells like the cell 200 of Figure 14, such that cell [ij] is row j and column i, with number increasing from top to bottom and from left to right.
  • X t-1 shows that the pixel of the target block is propagated down the array each time cycle.
  • T t-1 represents the pixel of the reference block presented at time t. This too propagates down the array so that the same row of the target block is being matched to the corresponding row of the reference block.
  • each cell comprises three 8-bit registers, Py, Rj j and Aj j .
  • each row j of the array of systolic cells may contain two additional registers Carry[j] and Sum jj] which need to be at least 14 bits in size.
  • Each P[ij] has a multiplexer at its input to allow values to be loaded externally.
  • the difference of each pixel is output to the row accumulator.
  • step 700 begins in step 700 and proceeds to step 705 where the pixel block from the target region is loaded.
  • this comprises loading an 8x8 pixel-block from the target region into the systolic array.
  • the block matching algorithm is searching for a match with this target region.
  • the pixel values are loaded by presenting each row of 8 values from the target block on successive cycles to signals P[i,0] (for i from 0 to 7) at the top of an 8x8 array of cells.
  • Figure 17 illustrates the systolic array 750 and 64-bit word from the search space.
  • step 710 to begin searching, the first 64-bit word from the top-left hand corner of the search space is fetched from memory and inserted into the triangular input array at the right-hand side of the systolic array, that is, the first 64-bit word of the reference blocks.
  • step 715 the sum of absolute difference (SOAD) is computed for each row.
  • SOAD sum of absolute difference
  • the systolic array needs to be primed for 8 cycles before it begins the first SOAD calculation. This is to allow for the first input word to propagate through the triangular array of input cells and arrive at the row 0 of the systolic array. Alternatively, the first 8 results can be discarded.
  • This SOAD value is stored in step 720 and the block is incremented through another eight clock cycles in step 725.
  • the loading of 64-bit words from the search space continues at the rate of one word per cycle for the remainder of the search.
  • the block with the minimum SOAD value is determined. In various embodiments, this block is considered a match.
  • the N x M blocks of the search space are scanned sequentially, proceeding horizontally across the search space, such that a new column of pixels is added and an old one dropped from the target block on each sequence until each vector has been tested.
  • Each block is loaded into the systolic array, starting at row 0 of the block and continuing to row 7, before moving on to consider the next block (that is, incrementing the block by one column) in the search space. This defines an address pattern that must be followed when fetching words from the search space. When the last word of the last block in the N x M search space has been loaded, the search terminates.
  • Figure 18 illustrates an exemplary target array 760 of 8x8 pixels 500 to be loaded in the systolic array and an exemplary search space in which the matching block in the search space is located at motion vector 12, in the X direction from the top left of the search space.
  • the systolic array operates such that on each clock pulse, all A[i,j] values are added together using a carry-save adder tree, sometimes referred to in the industry as a Wallace Tree. This produces two 14-bit values which, are assigned to the Carryjj] and Sum[j] registers on each successive clock pulse.
  • each row produces one SOAD value every eight clock pulses.
  • Each result represents the sum of the absolute differences between the target block and a reference block from the search space.
  • results appear one per cycle in a cyclic manner starting with row 0, and continuing to rows 1, 2, ..., and 7.
  • these results can be stored.
  • these results are fed into a block of logic to compute the minimum of all computed SOAD values and to associate this minimum value with the position of the corresponding block in the NxM search space.
  • the position of the block in the search space that generates the minimum SOAD values defines a potential motion vector for performing motion-compensation in a block- based video encoder such as the H.264 codec.
  • the carry-save adder associated with each row of 8 cells can be partitioned into two smaller carry-save adders and two separate pairs of carry/sum registers. Carry[u][j] and Sum[u][j], foru ⁇ 0,1 ⁇ . In various embodiments this partitioning can be controlled by a mode bit so that the full- row computation or the dual half-row computation can be set at run time. In the half- row mode of operation, each row computes for 4 cycles rather than the 8 cycles described in the context of the 8x8 blocks.
  • each row produces two SOAD values representing the closeness of the match between two adjacent 4x4 blocks from the target block and two adjacent 4x4 blocks from the search space.
  • SOAD values representing the closeness of match for the four 4x4 quadrants of an 8x8 block, can be added together to generate the SOAD value for an 8x8 block. If for example, quadrants A, B, C, and D are equal-sized sub-blocks of an 8x8 block, the generated 4x4 SOAD values can be added pair-wise to generate all possible sub-block SOAD values for 8x4 or 4x8 blocks.
  • Various embodiments of the invention may also provide for storing more than one minimum SOAD value for the target block, or each sub-block if performing a sub- block search. This allows the logic to compute not only the "best matching block” but also the "next best matching block.” If these two blocks differ by at most 1 in both x and y directions, then it may be possible to find a better match by searching at sub- pixel resolution. This can be performed through conventional techniques by using an up-sampling technique (such as, for example, bilinear or bicubic interpolation) over a smaller search space defined by a perimeter that is the smallest rectangle enclosing the two best matching blocks. In various embodiments, when an upsampled search space has been computed, the same systolic array can be used to search this space using the same method. In this case, the target block does not need to be loaded because it will already be in place within the array.
  • an up-sampling technique such as, for example, bilinear or bicubic interpolation
  • the systolic array-based systems and methods according to the various embodiments discussed herein provides various improvements over existing systems and methods.
  • the block matching array is provided with the ability to load the array with 64-bit values that are always read from 64-bit aligned memory locations. This is an advantage over previous schemes, which load the systolic array with 8-bits per cycle.
  • previous systems and methods have been limited in the range of N and M 3 i.e. in the definition of the search space.
  • the systems and methods according to the various embodiments of the invention can search an unbounded space.
  • any systolic array that passes partial SOAD values from cell to cell must contain a partial SOAD register in each cell. These are 14 bits in size, which is nearly twice the size of the 8-bit local absolute difference value (A[i J]) needed in this scheme. In the systolic-array based block matching according to the various embodiment of the invention, only two 14-bit registers are needed per row.
  • Another advantage of the systolic array-based block matching scheme is that the carry save register values need to be added before the final SOAD value is made available. As the final Carry/Sum values resulting from each row are produced at different times, the same full-adder circuit can be shared by all rows. Alternatively, if computing 4x4 sub-block SOAD values, only two full-adders will be required. The ability to select between computing 1 8x8 or 4 4x4 SOAD values per cycle is another improvement compared to previous schemes. This is particularly useful for encoding video streams using the H.264 standard, which supports motion compensation on variable block sizes, i.e., 4x4, 8x4, 4x8 and 8x8.
  • the systolic array-based block matching system and method discussed above can be extended to match blocks of dimension PxQ, where P is ⁇ 4, 8, 16 ⁇ and Q is any multiple of 4.
  • P is ⁇ 4, 8, 16 ⁇ and Q is any multiple of 4.
  • the systolic array would be 16x16 instead of 8x8 as discussed above.
  • four 8x8 SOAD values can be combined to produce a 16x16 block SOAD using the configuration discussed above in the context of Figures 5 and 6.
  • Such modifications are within the spirit and scope of the invention.
  • the systolic array-based block matching system and method according to the various embodiments of the invention is highly flexible.
  • FIG. 19 a diagram illustrating the components of a parameterizable clip instruction 800 for either SIS or SIMD processor architectures according to at least one embodiment of the invention is provided.
  • algorithms in numerical computations such as those common in video encoding/decoding, often require results to be clipped to be within a specified range of values. For example, in video processing, a system will have a maximum pixel depth depending on the system's resolution. If the value of an intermediate calculation result, such as an interpolation or other calculation lies outside the maximum value the final result will have to be clipped to a saturation value, for example, the maximum pixel value.
  • Such a software clipping implementation incurs a high overhead due to the number of calculations required to test each value.
  • the sequential nature of a software implementation makes it very difficult to be optimized in processors designed to exploit instruction level parallelism, such as, for example, SIS reduced instruction set (RISC) machines or very long instruction word (VLIW) machines.
  • RISC reduced instruction set
  • VLIW very long instruction word
  • Some processors do implement clipping at the hardware level using specialized processor instructions, however, the clipping ranges of these instructions are fixed to some value, typically a power of two. Therefore, various embodiments of this invention provide a parameterizable clip instruction for a microprocessor that enables adjustment of clipping parameters.
  • the instruction 800 labeled "VBLCIP" contains three elements, rd, rb and re.
  • Rb and rd are the source and destination register addresses respectively. That is, rb is the register address of the value to be clipped and rd is the register address where the clipped value is to be written.
  • Rc is the controlling parameter for the instruction. The value of re dictates how the value located at address rb will be clipped. This instruction permits 8 16 bit values to be clipped within the range specified by the control parameter re.
  • Figure 20 illustrates the format of controlling parameter re in the form of a 32-bit operand and FIG. 21 is a table illustrating the ways in which the parameters of the parameterizable clip instruction may be specified.
  • the input re is a 32 bit input.
  • re may be 16, 32, 64, 128 or other bit size.
  • the most significant 16 bits that is, bits 31 to 16 are unused as seen in the table.
  • bits 15 and 14 are reserved for the range type, while bits 13-0 are used for the range specifier.
  • range types [0, 2 N -1], [-N, N], [-2 N , 2 N -1] and [0, N] corresponding to 2-bit binary values 00, 01, 10 and 11.
  • the remaining 14 least significant bits, bits 13 to bit 0 are used to represent N, the range specifier. These bits contain a binary number having a maximum value of 11111111111111 (16383).
  • range type 01 or 11 ranges not limited to powers of two may be used.
  • the range specifier N is itself a parameter supplied to the VBCLIP instruction 100.
  • the bit type RT specifies one of the four possible ways the clipping range can be defined using the range specifier N.
  • Range types 00 and 10 are designed to work with unsigned and signed clipping ranges respectively, while types 01 and 11 are designed to work with signed and unsigned clipping ranges that are not powers of two.
  • the VBCLEP instruction is therefore a highly flexible processor implementation of clipping.
  • FIGS. 20 and 21 describes VBCLIP as an SIS instruction, the instruction syntax can easily be extended to SHVID architectures in which both registers rb and re are vector registers.
  • clipping as specified in re, is applied to each slice of the vector register rb with the results assigned to the corresponding slice in rd.
  • An additional advantage of a SIMD version of the clipping instruction is bypassing the data dependent sequential nature of clipping operations that is awkward to implement in parallel machines.
  • step 900 begins in step 900 and proceeds to step 905 where the clip instruction is fed to the microprocessor pipeline.
  • the instruction comprises an instruction taking the form of a name and three input operands: a destination address, a source address and a controlling parameter.
  • step 910 the data to be operated on is fetched from the source address specified in the instruction.
  • step 915 the range type indicated in the instruction is referenced to determine the actual range after decoding the instruction.
  • the range type is represented by two bits of the input operand's controlling parameter re.
  • a table is stored in a memory register of the processor that maintains a list of the range types indexed by the two-bit code.
  • the range specifier is extracted from the instruction and using the range type, a range is determined.
  • the value fetched in step 910 is clipped in accordance with the range determined in step 920.
  • the result is written to the destination address specified in the destination address input operand rd of the instruction. Operation of the method stops in step 935.
  • the SIMD architecture is particularly well suited for applications such a media processing including audio, images and video due to the fact that a few operations are repeatedly performed on relatively large blocks of data.
  • Video codecs are at the heart of nearly all modern digital video products including DVD players, cameras, video-enabled communication devices, gaming systems, etc.
  • lossy image and video compression algorithms discard only perceptually insignificant information so that to the human eye the reconstructed image or video sequence appears identical to the original uncompressed image or video. In practice, some artifacts may be visible. This can be attributed to poor encoder implementation, video content that is particularly challenging to encode, or a selected bit rate that is too low for the video sequence, resolution and frame rate.
  • Blocking artifacts are due to the fact that compression algorithms divide each frame into blocks of 8x8 pixels, 16x16 pixels, etc. Each block is reconstructed with some small errors, and the errors at the edge of a block often contrast with the errors at the edges of neighboring blocks, making block boundaries visible. Ringing artifacts appear as distortions or blurs around the edges of image features. Ringing artifacts are due to the encoder discarding too much information in quantizing the high-frequency transform coefficients.
  • Video compression applications often employ filters following decompression to reduce blocking and ringing artifacts. These filtering steps are known as “deblocking” and “deringing,” respectively. Both deblocking and deringing may be performed using low-pass FIR (finite impulse response) filters to hide these visible artifacts.
  • FIR finite impulse response
  • H.264 was jointly developed by the Moving Picture Experts Group (MPEG) and the International Telecommunication Union (ITU). It is also known as MPEG-4 Part 10 Advanced Video Coding (AVC).
  • VCl is a video codec specification based on MICROSOFT WINDOWS Media Video (WMV) 9 compression technology that is currently being standardized by the Society of Motion Picture and Television Engineers (SMPTE).
  • WMV MICROSOFT WINDOWS Media Video
  • One key attribute of a video compression application is the bit-rate of the compressed video stream. Codecs that target specific applications are designed to stay within the bit-rate constraints of these applications, while offering acceptable video quality. DVDs use 6-8 Mbps with MPEG-2 encoding. However, emerging digital video standards such as HDTV and HD-DVD can demand up to 20-40 Mbps using MPEG-2. Such high bit-rates translate into huge storage requirements for HD-DVDs and a limited number of HDTV channels. Thus, a key motivation for developing a new codec is to lower the bit-rate while preserving or even improving the video quality relative to MPEG -2. This was the motivation that led to the development of both the H.264 and VCl codecs. These codecs achieve significant advances in improving video quality and reducing bandwidth, but at the cost of greatly increased computational complexity at both the encoder and decoder.
  • a deblocking filter operation is specified by both the H.264 and VCl codecs in order to remove blocking artifacts from each reconstructed frame that are introduced by the lossy, block-based operations.
  • Each video frame is divided into 16x16 pixel macroblocks and each macroblock is further divided into sub-blocks of various sizes for transforms.
  • the deblocking filter is applied to all the edges of such sub-blocks. For each block, vertical edges are filtered from left to right first and then horizontal edges are filtered from top to bottom. The deblocking process is repeated for all macroblocks in a frame.
  • An advantage of these block filter instructions over traditional SIMD techniques is that adjacent data elements within a row can be loaded into a vector register as in a typical column-based operation, but instead of performing the same operation on each slice, a dedicated data path is used to compute the entire horizontal computation without the need to first re-arrange the data in memory which would incur a high overhead.
  • Figure 23 depicts a pair of SIMD assembler instructions that are each pipelined to a single slot cycle with a three cycle latency for implementing the H.264 deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention.
  • the instructions are formatted by name, output register VR a , first input register VR b and second input register VR 0 .
  • Figure 24 illustrates the contents of the 128-bit register VR b containing the first input operand to both deblock instructions VH264FT and VH264F of Figure 23.
  • the input comprises a 128-bit wide data consisting of a horizontal row of eight pixels (represented as 16-bit values) from two pixel blocks, that is a horizontal row of eight 16-bit pixels that span a block edge between the fourth and fifth pixels.
  • the first input to each of these instructions is 8 16- bit luma values of 8 pixels in a row.
  • Figure 25 illustrates the contents of a 128-bit register containing the second input operand VR 0 to the deblock instruction VH264FT of Figure 23. Only the lower half of the 128-bit register is used.
  • the lower 64-bits contain the H.264 filter threshold parameters alpha and beta, the strong flag and the filter strength CO. These parameters are derived directly from clauses 8.7.2.1 and 8.7.2.2 the H.264 specification.
  • Figure 26 illustrates the contents of a 128-bit register VR 3 containing the output of the first deblock instruction VH264FT split into eight 16-bit fields: C, beta, CO, Udelta, UpID, UqID and Flags, which are derived from the inputs in accordance with Table 1.1 below:
  • the second instruction VH264F of Figure 23 takes the same 8 pixels input to the VH264FT instruction as its first input operand VR b and the output of the first instruction depicted in Figure 6 as the second input operand VR 0 .
  • the output of the second instruction VH264F which is stored in destination register VR a is eight pixels PO, Pl, P2, P3, QO, Ql, Q2, and Q3 is calculated based on tables 1.2, 1.3, 1.4 and 1.5 below depending on the input conditions as follows:
  • FIG. 27 is a pixel diagram illustrating the 4x8 block of pixels as pixels from two adjacent blocks with a block edge between four pixels of blocks A and B in each row.
  • Figure 28 depicts a pair of single-cycle SIMD assembler instructions for implementing the VCl deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention. It should be appreciated that in contrast to the H.264 codec where the filter instructions according to the embodiments of the present invention are applied only the luma components, in the VCl codec, the filter instructions are applied to both the luma and chroma components.
  • the instructions are formatted by name, output register VR a , first input register VR b and second input register VR 0 .
  • Figure 29 illustrates the contents of the 128-bit register VR b containing the first input operand to the first deblock instruction WClFT of Figure 6.
  • the operand comprises a 128-bit wide data consisting of a horizontal row of eight 16-bit pixels from two adjacent blocks, that is, pixels P1-P8.
  • Figure 30 illustrates the contents of a 128-bit register containing the second input VR 0 to the deblock instruction WClFT of Figure 27, in this case, just the VCl filter quantization parameter. Only one of the 16-bit portions of the register is used to store this value. This parameter is derived directly from section 8.6.4 of the VCl specification.
  • Figure 31 illustrates the output of the first deblock instruction WClFT in register VR a which in this case is comprised of five values PQ, a ⁇ , a3, Clip and Aclip, derived from table 2.1 as follows:
  • the second instruction VVClF also takes two input operands, VR b and VR 0 which, contain the same pixel data input to the first instruction WClFT and the content of the output register of the first instruction respectively.
  • the results of the second instruction VVClF are output to the destination register address specified by input VR a .
  • the VCl instructions have a slightly different usage than the H.264 ones.
  • the result is 8 pixels P1-P8 calculated according to Table 2.2 as follows:
  • the VCl test instruction is designed to be used in special order on a group of four registers.
  • the WClFT instruction must be executed on the 3 rd row first. If, based on this, it turns out that the other rows shouldn't be filtered, the PQ parameter is zeroed. This implies that d will also be zeroed, therefore, P4 and P5 won't change. However, WClFT still needs to be executed for the other rows to produce clip, aO and al, which are row specific.
  • H.264 and VCl codec are two emerging video codec standards designed to facilitate high quality video required by today's electronic.
  • H.264 was jointly developed by the Moving Picture Experts Group (MPEG) and the International Telecommunication Union (ITU). It is also known as MPEG-4 Part 10 Advanced Video Coding (AVC).
  • VCl is a video codec specification based on MICROSOFT WINDOWS Media Video (WMV) 9 compression technology that is currently being standardized by the Society of Motion Picture and Television Engineers (SMPTE).
  • WMV MICROSOFT WINDOWS Media Video
  • One key attribute of a video compression algorithm is the bit-rate of the compressed video stream. Codecs that target specific applications are designed to stay within the bit-rate constraints of these applications, while offering acceptable video quality. DVDs use 6-8 Mbps with MPEG-2 encoding. However, emerging digital video standards such as HDTV and HD-DVD can demand up to 20-40 Mbps using MPEG-2. Such high bit-rates translate into huge storage requirements for HD-DVDs and a limited number of HDTV channels. Thus, a key motivation for developing a new codec is to lower the bit-rate while preserving or even improving the video quality relative to MPEG -2. This was the motivation that led to the development of both the H.264 and VCl codecs. These codecs achieve significant advances in improving video quality and reducing bandwidth, but at the cost of greatly increased computational complexity at both the encoder and decoder.
  • FIG. 32 a schematic diagram illustrating a SIMD topology for performing homogeneously parallel mathematical operations is provided.
  • the data path is divided into several identical lanes, each performing the same operation on different slices of the wide input data as required by the instruction being executed, such as the parallel data path 1000 in Figure 32.
  • This Figure illustrates an example in which a typical SIMD machine is performing eight 16- bit additions on the N-bit inputs B and C to produce an N-bit output A.
  • A, B and C are 128 bits in width, hi the Figure, we use the notation K n to represent the slice K[(16n+15):16n] of the 128-bit data K[127:0], for example, B 6 is the slice B[111 :96] of the input B.
  • the filter To implement filter operations using the type of SIMD machine depicted in Figure 1, the filter must be broken down into primitive operations such as additions and multiplications. All theses primitive operations must then be performed in parallel in each data lane of the machine, effectively performing several independent filter operations in parallel on different data sets. This type of parallelism may be characterized as homogeneous to the extend that each data lane is performing the same operation.
  • FIG. 33 a diagram illustrating an array of pixels for describing an implementation of a filter-based interpolation method for performing sub-pixel interpolation in the H.264 video codec standard according to at least one embodiment of the invention is provided.
  • the array of pixels 1200 illustrates the problem of inter- pixel interpolation.
  • the H.264 codec specification allows for inter- pixel interpolation down to 1 A and 1 A pixel resolutions.
  • this process is performed at the SEVID processor level through a six-tap finite impulse response (FIR) filter.
  • FIR finite impulse response
  • Sub-pixels aa, bb, b, cc, dd, h, j, m, k, ee, ff, s, gg, and hh denote sub-pixel locations that are between two adjacent pixels in either the vertical, horizontal or diagonal directions.
  • the value of the interpolated pixel may be given by a six tap FIR filter applied to integer-position pixels with weights 1/32, - 5/32, 5/8, 5/8, -5/32 and 1/32.
  • clip(a) means to clip the value to a range of 0-255.
  • the value of b can be calculated using 16-bit arithmetic.
  • the filter is applied to the raw interpolated pixels of the closest sub- pixel positions as required by the H.264 standard.
  • the raw interpolated pixels can be selected either horizontally or vertically since each produces the same result. If selected horizontally, the raw interpolated pixels used would be cc ⁇ , dd', h", m', ee ⁇ and ff . These correspond to the pixels cc, dd, h, m, ee and ff in Figure 33 in the same way as b" was related to b above, i.e.
  • intermediate results of the filter function computation (that is, the result of ⁇ - 5d ⁇ f + 2OK + 20m' - 5ee ⁇ + ff ) can have values in the range -214200 to 475320 which can only be represented with a 20-bit number when the target sub-pixel is located diagonally between integer pixel positions.
  • the instruction takes as input 8 16-bit values that are positioned horizontally in a row and are results from vertical 6-tap filter operations previously performed. This input contains sufficient data for two adjacent filter operations to be performed.
  • the two adjacent 6- tap filter operations concurrently using internal 20-bit arithmetic operations, and outputs 2 interpolated luma pixel values.
  • r0 (s ⁇ - 5sl + 20s2 + 20s3 - 5s4 + s5 + 512)»10
  • rl (si - 5s2 + 20 s3 + 20s4 - 5s5 + s6 + 512)»10
  • the results r0 and rl are two adjacent interpolated pixels that are diagonally positioned between integer pixel positions, such as pixels j and k of Figure 33.
  • a relatively simple instruction is capable of performing the majority of processing necessary to produce these two interpolated pixels, hi the context of Figure 33, if j corresponds to clipped r ⁇ , then s ⁇ , si, s2, s3, s4 and s5 are the raw interpolated pixels corresponding to cc, dd, h, m, ee and ff respectively.
  • Result rl would correspond to the interpolated pixel k between pixel m and pixel ee.
  • the computation of rl would require a value for pixel s6 which is not shown in Figure 2.
  • Figure 34 is a schematic diagram illustrating a SIMD topology for performing heterogeneously parallel mathematical operations according to at least one embodiment of the invention.
  • two filter operations such as the aforementioned sub-pixel interpolation filter required in modern video codecs, are performed concurrently.
  • the outputs of these concurrent filter operations correspond to the values of two adjacent interpolated pixels that are at half-pixel diagonal positions, for example, sub-pixels j and k in Figure 33.
  • each lane performs several operations that are specific to the lane. Intermediate results in all data lanes are then summed up and shifted as required to produce two 16-bit output results. These two results are distributed across all even and odd data lanes respectively and are written back to the corresponding slices of the destination register under the control of a mask mechanism.
  • an important aspect of the embodiments of the current invention is the exploitation of heterogeneous parallelism, in which different data lanes perform different operations, in a SIMD machine that typically only exploits homogeneous parallelism, in which all data lanes perform the same operation.
  • Another important aspect is the dedicated internal data path used to implement the required operations can be adjusted according to the required width of the intermediate computation. Since the input and output data are well within the native precision of the SMD machine, i.e. 16-bit, and only the intermediate computation requires extra precision, i.e. 20-bit, it is sufficient to only widen the dedicated internal data path while the rest of the SIMD pipeline still just needs to support the native precision it was designed for. Hence the wasteful widening of the entire SIMD pipeline is avoided.
  • FIG 35 is a flow chart of an exemplary method for accelerating sub-pixel interpolation according to at least one embodiment of the invention.
  • the method begins in step 1400 and proceeds to step 1405 where the raw interpolated pixel input values are calculated.
  • a filter operation must first be performed to determine the value of the raw interpolated pixels located horizontally or vertically between integer pixels positions, in the same row or column as the target diagonal sub-pixels.
  • the intermediate raw interpolated sub-pixels are input into the SIMD data path for performing the interpolation calculation of two diagonally oriented sub-pixels.
  • this is accomplished with a single instruction that inputs the values of 7 raw interpolated pixels allowing for two adjacent diagonally oriented pixels to be determined.
  • the intermediate filter calculations are performed in each data lane of the SIMD data path. In various embodiments, this comprises performing multiplication operations to derive the filter coefficients, e.g., - 5sl + 20s2 + 20s3 - 5s4, etc.
  • the intermediate results for each of the two sub-pixel calculations are summed across the data path.
  • summation results are right shifted by 10-bits. hi various embodiments, the results are distributed across odd and even data lanes respectively.
  • the results of this operation are written back to corresponding slices of the destination register.
  • steps 1410 through 1430 are performed with a single instruction thereby accelerating the sub-pixel interpolation process significantly.
  • Step 1405 is a pre-computation that is required as an input for the instruction according to the various embodiments of the invention.
  • the output of step 430 has to be clipped to produce the actual interpolated sub-pixel.

Abstract

An architecture for microprocessor-based systems is provided. The architecture includes a SIMD processing unit and associated systems and methods for optimizing the performance of such a system. In one embodiment systems and methods for performing systolic array-based block matching are provided. In another embodiment, a parameterizable clip instruction is provided. In an additional embodiment, two pair of deblocking instruction for use with the H.264 and VCl codecs provided. In a further embodiment, an instruction and datapath for accelerating sub-pixel interpolation are provided. In another embodiment, systems and methods for selectively decoupling processor extension logic are provided. In yet another embodiment, systems and methods for recording instruction sequences in a microprocessor having dynamically decoupleable extension logic is provided, hi yet one other embodiment, systems and methods for synchronizing multiple processing engines of a SIMD engine are provided.

Description

ARCHITECTURE FOR MICROPROCESSOR-BASED SYSTEMS INCLUDING SIMD PROCESSING UNIT AND ASSOCIATED SYSTEMS AND METHODS
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 60/721,108 titled "SMD Architecture and Associated Systems and Methods," filed September 28, 2005, the disclosure of which is hereby incorporated by reference in its entirety.
FIELD OF THE INVENTION
[0002] The invention relates generally microprocessor architectures and more specifically to systems and methods for optimizing a microprocessor-based system including a SIMD processing unit.
BACKGROUND OF THE INVENTION
[0003] Processor extension logic is utilized to extend a microprocessor's capability.
Typically, this logic is in parallel and accessible by the main processor pipeline. It is often used to perform specific, repetitive, computationally intensive functions thereby freeing up the main processor pipeline.
[0004] Processor extension logic may be implemented as an addition to supplement the capabilities of a main processor core. In some cases, this processor extension logic may be optimized to perform specific calculation intensive functions such as video processing and/or audio processing in order to free the main processor core to continue performing other functions.
[0005] Single instruction multiple data (SIMD) architectures have become increasingly important as demand for video processing in electronic devices has increased. The SMD architecture exploits the data parallelism that is abundant in data manipulations often found in media related applications, such as discrete cosine transforms (DUCT) and filters. Data parallelism exists when a large mass of data of uniform type needs the same instruction performed on it. Thus, in contrast to a single instruction single data (SIS) architecture, in a SIMD architecture a single instruction may be used to effect an operation on a wide block of data. SMD architecture exploits parallelism in the data stream while SIS can only operate on data sequentially. [0006] An example of an application that takes advantage of SIMD is one where the same value is being added to a large number of data points, a common operation in many media application. One example of this is changing the brightness of a graphic image. Each pixel of the image may consist of three values for the brightness of the red, green ad blue portions of the color. To change the brightness, the R, G and B values, or alternatively the SUV values are read from memory, a value is added to it, and the resulting value is written back to memory. A SIMD processor enhances performance of this type of operation over that of a SIS processor. A reason for this improvement is that that in SIMD architectures, data is understood to be in blocks and a number of values can be loaded at once. Instead of a series of instructions to incrementally fetch individual pixels, a SIMD processor will have a single instruction that effectively says "get all these pixels" Another advantage of SIMD machines is multiple pieces of data are operated on simultaneously. Thus, a single instruction can say "perform this operations on all the pixels." Thus, SIMD machines are much more efficient in exploiting data parallelism than SIS machines.
[0007] A video sequence consists of a number of still image frames presented in time sequence to create the appearance of continuous motion. High quality video is usually comprised of thirty or more frames per second. Thus, when digitizing a high resolution video clip, the required bandwidth increases rapidly. The amount of data required to represent even a single picture (still image) is derived by the frame's dimensions multiplied by the pixel depth. Thus, even 640x480 video with a pixel depth of 256, that is, 8 bits for each of the RIB or SUV elements of each pixel would require 0.9216 Megabytes per frame without compression. At thirty frames per second, that is a throughput of 27.648 Mbytes per second. However, because video is merely a sequence of frames, subsequent frames are often very similar in terms of their content, containing a lot of redundant data. When compressing video, this reduntant data is removed to achieve data compression.
[0008] In video compression applications, motion compensation describes a current frame in terms of where each block of that frame came from in a previous frame. Motion compensation reduces the amount of data throughput required to reproduce video by describing frames by their measured change from previous and subsequent frames. [0009] Various techniques exist for performing motion compensation. A first approach is to simply subtract a reference frame from a given frame. The difference is called residual and usually contains less information than the original frame. Thus, rather then encoding the frame, only the residual is encoded. The residual can be encoded at a lower bit-rate without degrading the image quality. The decoder can reconstruct the original frame by simply adding the reference frame again.
[0010] Another technique is to estimate the motion of the whole scene and the objects in a video sequence. The motion is described by some parameters that have to be encoded in the bit-stream. The blocks of the predicted frame are approximated by appropriately translated blocks of the reference frame. This gives more accurate residuals than a simple subtraction. However, the bit-rate occupied by the parameters of the motion model can become quite large. This runs contrary to the goal of achieving high compression ratios.
[0011] Video frames are often processed in groups. One frame (usually the first) is encoded without motion compensation just as a normal image, that is, without compression. This frame is called I-frame or I-picture. The other frames are called P-frames or P- pictures and are predicted from the I-frame or P-frame that comes (temporally) immediately before it. The prediction schemes are, for instance, described as IPPPP, meaning that a group consists of one I-frame followed by four P-frames.
[0012] Frames can also be predicted from future frames. The future frames then need to be encoded before the predicted frames and thus, the encoding order does not necessarily match the real frame order. Such predicted frames are usually predicted from two directions, i.e. from the I- or P-frames that immediately precede or follow the predicted frame. These bidirectionally predicted frames are called B-frames.
[0013] In block motion compensation, frames are partitioned in blocks of pixels (e.g. macroblocks of 16x 16 pixels in MPEG). Each block is predicted from a block of equal size in the reference frame. The blocks are not transformed in any way apart from being shifted to the position of the predicted block. This shift is represented by a motion vector. The motion vectors are the parameters of this motion compensation model and have to be encoded into the bit-stream. [0014] Existing block matching methods may be performed in software, or may be implemented by a special-purpose hardware device. Software implementations have the disadvantage of being slow, whereas hardware solutions often lack the flexibility needed to support a wide range of different video encoding standards. A specific problem associated with both software and hardware techniques is that of memory alignment. To achieve high performance motion estimation, the pixels of the reference frame should be retrieved from memory in groups of 8 or even 16. However, blocks of pixels from the reference frame are not guaranteed to be located in memory at an address that is an integer multiple of 8. This may require non-aligned accesses, with extra hardware and additional memory access cycles, and is therefore one problem with existing methods.
[0015] Numerical computation algorithms, such as those common in video encoding/decoding, often require results to be clipped to be within a specified range of values. For example, in video processing, a system will have a maximum pixel depth depending on the system's resolution. If the value of an intermediate calculation result, such as interpolation or other calculation, lies outside the maximum value the final result will have to be clipped to the saturation value, for example, the maximum pixel value.
[0016] Clipping is typically implemented in software using a sequence of instructions that first test the intermediate value and then conditionally assign the final value, for example, if value > maximum, then value = maximum. Such a software clipping implementation incurs a high overhead due to the number of calculations required to test each value. The sequential nature of a software implementation makes it very difficult to be optimized in processors designed to exploit instruction level parallelism, such as, for example, SIS reduced instruction set (RISC) machines or very long instruction word (VLIW) machines. Some processors do implement clipping at the hardware level using specialized processor instructions, however, the clipping ranges of these instructions are fixed to some value, typically a power of two.
[0017] One disadvantage of SIMD system is that they can require additional memory registers to support data which increases processor complexity and cost or they share resources such as registers with processing units of the CPU. This can cause competition for resources, conflicts, pipeline stalls and other events that adversely effect overall processor performance. A major disadvantage of SlMD architecture is the rigid requirement on data arrangement. The overhead to rearrange data in order to exploit data parallelism can significantly impact the speedup in computation and can even negate the performance gain achievable by a SIMD machine in comparison to a conventional SIS machine. Also, attaching a SMD machine as an extension to a conventional SIS machine can cause various issues like synchronization, decoupling, etc. Thus, there exists a need for a SIMD-based architecture that fully exploit the advantages of parallelism without suffering from the design complexity or other shortcomings of conventional systems.
SUMMARY OF THE INVENTION
[0019] Accordingly, in view of the foregoing, at least one embodiment of the invention provides a method of dynamically decoupling a parallel extended processor pipeline from a main processor pipeline. The method according to this embodiment comprises sending an instruction from the main processor pipeline to the parallel extended processor pipeline instructing the parallel extended processor pipeline to operate autonomously, operating the parallel extended processor pipeline autonomously, storing subsequent instructions from the main processor pipeline to the parallel extended processor pipeline in an instruction queue, executing an instruction with the parallel extended processor pipeline to cease autonomous execution, and thereafter executing instructions supplied by the main processor pipeline in the queue.
[0020] At least one other embodiment of the invention provides a microprocessor architecture. The microprocessor architecture according to this embodiment comprises a main instruction pipeline, and an extended instruction pipeline, wherein the main instruction pipeline is configured to issue a begin record instruction to the extended instruction pipeline, causing the extended instruction pipeline to begin recording a sequence of instructions issued by the main instruction pipeline.
[0021] Another embodiment of the invention provides a method for synchronization of multiple processing engines in an extended processor core. The method according to this embodiment comprises placing direct memory access (DMA) functionality in a single instruction multiple data (SIMD) pipeline, where the DMA functionality comprises a data-in engine and a data-out engine, and each DMA engine is allowed to buffer at least one instruction issued to it in a queue without stopping the SIMD pipeline. The method may also comprise, when the DMA engine queue is full, and a new DMA instruction is trying to enter the queue, blocking the SIMD pipeline from executing any instructions that follow until the current DMA operation is complete, thereby allowing the DMA engine and SIMD pipeline to maximize parallel operation while still remaining synchronized.
[0022] Another embodiment of the invention provides a method of performing block matching with a systolic array. The method according to this embodiment comprises selecting an NxN target pixel block, selecting an NxN reference block from a starting point of an NxM reference block search space, propagating the target and reference blocks through N cycles to completely load the target and reference blocks into an array of systolic cells, computing a sum of absolute difference (SOAD) between a pixel of the target block and the reference block for each N rows of the array, saving the SOAD for the current reference block, incrementing to the next reference block of the NxM reference block search space and selecting a new NxN reference block, repeating the propagating, computing, saving and incrementing steps until all blocks in the reference search space have been tested, and selecting the block from the reference search space having the lowest SOAD as a motion vector for the target block.
[0023] Another embodiment of the invention may provide a method of causing a microprocessor to perform a clip operation. The method according to this embodiment may comprise providing an assembly instruction to the microprocessor, the instruction comprising an input address, an output address and a controlling parameter, decoding the instruction with logic in the microprocessor, retrieving a data input from the input address, determining a specific clip operation based on the controlling parameter, performing the clip operation on the data input, and writing the result to output address.
[0024] A further embodiment of the invention provides a method of causing a microprocessor to perform a CODEC deblocking operation on a horizontal row of image pixels. The method according to this embodiment comprises providing a first instruction to the microprocessor having three 128-bit operands comprising the 16-bit components of a horizontal row of pixels in a SUV image as a first input operand, wherein the horizontal row of pixels are in image order and include four pixels on either side of a pixel block edge, at least one filter threshold parameter as a second input operand, and a 128-bit destination operand for storing the output of the first instruction as a third operand, calculating an output value of the first instruction, and storing the output value of the first instruction in the 128-bit destination register. The method according to this embodiment also comprises providing a second instruction to the microprocessor having three 128-bit operands comprising the first input operand of the first instruction as the first input operand, the output of the first instruction as a second input operand, and a destination operand of a 128-bit register for storing an output of the second instruction as the third operand, calculating an output value of the second instruction, and storing the output value in the 128-bit register specified by the destination operand of the second instruction.
[0025] An additional embodiment according to the invention provides a method for accelerating sub-pixel interpolation in a SHvID processor. The method according to this embodiment comprises determining a set of interpolated pixel values corresponding to sub-pixel positions horizontally between integer pixel locations, inputting the set of interpolated pixel values into an N lane SIMD data path, where N is an integer indicating the number of parallel, 16-bit data lanes of the SIMD processor, performing separate, simultaneous intermediate filter operations in each lane of the N lane data path, summing the intermediate results and perform right shifting to the sums, and outputting a value of two adjacent sub-pixels located diagonally with respect to integer pixel locations and in the same row as the set of interpolated pixel values.
[0026] These and other embodiments and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] In order to facilitate a fuller understanding of the present disclosure, reference is now made to the accompanying drawings, in which like elements are referenced with like numerals. These drawings should not be construed as limiting the present disclosure, but are intended to be exemplary only.
[0028] Figure 1 is a functional block diagram illustrating a microprocessor-based system including a main processor core and a SIMD media accelerator according to at least one embodiment of the invention;
[0029] Figure 2 is a block diagram illustrating a conventional multistage microprocessor pipeline having a pair of parallel data paths;
[0030] Figure 3 is a block diagram illustrating another conventional multiprocessor design having a pair of parallel processor pipelines; [0031] Figure 4 is a block diagram illustrating a dynamically decoupleable multi-stage microprocessor pipeline according to at least one embodiment of the invention;
[0032] Figure 5 is a flow chart detailing the steps of a method for sending instructions for operating a main processor pipeline and an extended processor pipeline according to at least one embodiment of the invention;
[0033] Figure 6 is a flow chart detailing the steps of a method for dynamically decoupling an extended processor pipeline from a main pipeline according to at least one embodiment of the invention.
[0034] Figure 7 s a code fragment containing an example of a processor extension instruction sequence that is issued to the processor extension in accordance with various embodiments of the invention;
[0035] Figure 8 s a code fragment in which a processor extension instruction is preloaded to a memory location and then run from that location by the processor extension in accordance with various embodiments of the invention;
[0036] Figure 9 is a code fragment containing an example of an extension instruction sequence that is being issued and simultaneously captured and recorded in accordance with at least one embodiment of the invention;
[0037] Figure 10 is a flow chart of an exemplary method for recording instruction in an extended instruction pipeline and using such recorded instructions according to at least one embodiment of the invention.
[0038] Figure 11 is an instruction sequence flow diagram and corresponding event time line illustrating a method for synchronizing processing between DMA tasks and SIMD tasks according to at least one embodiment of the invention;
[0039] Figure 12 is a flow chart detailing steps of an exemplary method for synchronizing multiple processing engines in a microprocessor according to various embodiments of the invention;
[0040] Figure 13 is a block diagram illustrating an architecture for a systolic array-based block matching system and method according to at least one embodiment of the invention; [0041] Figure 14 is a diagram of a cell of systolic array according to at least one embodiment of the invention;
[0042] Figure 15 is a block circuit diagram of the components of a systolic cell according to an embodiment of the invention;
[0043] Figure 16 is a flow chart of an exemplary method for performing block matching in accordance with at least one embodiment of the invention;
[0044] Figure 17 is a diagram of a systolic array and 64-bit word taken from the search space to be compared using the block matching method according to at least one embodiment of the invention;
[0045] Figure 18 is an exemplary 8x8 target pixel block and exemplary Nx8 search space comprised of (N-7) 64-bit search words that the target pixel block is matched against according to at least one embodiment of the invention.
[0046] Figure 19 is a diagram illustrating the components of a parameterizable clip instruction for either SIS or SIMD processor architectures according to at least one embodiment of the invention;
[0047] Figure 20 illustrates the format of a 32-bit parameter input to the parameterizable clip instruction of Figure 19 according to at least one embodiment of the invention;
[0048] Figure 21 is a table illustrating the ways in which the parameters of the parameterizable clip instruction may be specified;
[0049] Figure 22 is a flow chart of an exemplary method of performing a clip operation with a parameterizable clip instruction according to at least one embodiment of the invention
[0050] Figure 23 is a pair of SIMD instructions that are each pipelined to a single slot cycle with a three cycle latency for implementing the H.264 deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention;
[0051] Figure 24 is a block diagram illustrating the contents of a 128-bit register containing the first input operand to the deblock instruction of Figure 23 according to at least one embodiment of the invention; [0052] Figure 25 is a block diagram illustrating the contents of a 128-bit register containing the second input operand to the deblock instruction of Figure 23 according to at least one embodiment of the invention;
[0053] Figure 26 is a block diagram illustrating the contents of a 128-bit register containing the output of the first deblock instruction split into eight 16-bit fields which is used as the second input operand to the second deblock instruction according to at least one embodiment of the invention;
[0054] Figure 27 is a pixel diagram illustrating the 4x8 block of pixels for processing with a pair of deblock instructions according to at least one embodiment of the invention;
[0055] Figure 28 is a pair of single-cycle SBVID assembler instructions for implementing the VCl deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention;
[0056] Figure 29 is a block diagram illustrating the contents of a 128-bit register containing the first input operand to the deblock instruction of Figure 28 according to at least one embodiment of the invention;
[0057] Figure 30 is a block diagram illustrating the contents of a 128-bit register containing the second input operand, the VCl filter quantization parameter, to the deblock instruction, of Figure 28 according to at least one embodiment of the invention;
[0058] Figure 31 is a block diagram illustrating the contents of a 128-bit register containing the output of the first deblock instruction split into eight 16-bit fields with data in the first five fields which is used as the second input operand to the second deblock instruction according to at least one embodiment of the invention;
[0059] Figure 32 is a schematic diagram illustrating a SIMD topology for performing homogeneously parallel mathematical operations;
[0060] Figure 33 is a diagram illustrating an array of pixels for describing an implementation of a filter-based interpolation method for performing sub-pixel interpolation in the H.264 video codec standard according to at least one embodiment of the invention; [0061] Figure 34 is a schematic diagram illustrating a SIMD topology for performing heterogeneously parallel mathematical operations according to at least one embodiment of the invention; and
[0062] Figure 35 is a flow chart of an exemplary method for accelerating sub-pixel interpolation according to at least one embodiment of the invention.
DETAILED DESCRIPTION
[0063] The following description is intended to convey a thorough understanding of the embodiments described by providing a number of specific embodiments and details involving systems and methods for performing block matching for motion compensation applications. It should be appreciated, however, that the present invention is not limited to these specific embodiments and details, which are exemplary only. It is further understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
[0064] Referring now to Figure 1, a functional block diagram illustrating a microprocessor- based system 5 including a main processor core 10 and a SIMD media accelerator 50 according to at least one embodiment of the invention is provided. The diagram illustrates a microprocessor 5 comprising a standard single instruction single data (SIS) processor core 10 having a multistage instruction pipeline 12 and a SIMD media engine 50. In various embodiments, the processor core 10 may be a processor core such as the ARC 700 embedded processor core available from ARC International Ltd. of Elstree, United Kingdom, and as described in provisional patent application number 60/572,238 filed May 19, 2004 entitled "Microprocessor Architecture" which, is hereby incorporated by reference in its entirety. Alternatively, in various embodiments, the processor core may be a different processor core.
[0065] In various embodiments, a single instruction issued by the processor pipeline 12 may cause up to sixteen 16-bit elements to be operated on in parallel through the use of the 128-bit data path 55 in the media engine 50. In various embodiments, the SIMD engine 50 utilizes closely coupled memory units. In various embodiments, the SIMD data memory 52 (SDM) is a 128-bit wide data memory that provides low latency access to perform loads to and stores from the 128-bit vector register file 51. The SDM contents are transferable via a DMA unit 54 thereby freeing up the processor core 10 and the SIMD core 50. In various embodiments, the DMA unit 54 comprises a DMA in engine 61 and a DMA out engine 62. In various embodiments, both the DMA in engine 61 and DMA out engine 62 may comprise instruction queues (labeled Q in the Figure) for buffering one or more instructions. In various embodiments, a SMD code memory 56 (SCM) allows the SIMD unit to fetch instructions from a localized code memory, allowing the SIMD pipeline to dynamically decouple from the processor core 10 resulting in truly parallel operation between the processor core and SMD media engine as discussed in commonly assigned U.S. Patent Application No. XX/XXX,XXX, titled, "Systems and Methods for Recording Instruction Sequences in a Microprocessor Having a Dynamically Decoupleable Extended Instruction Pipeline," filed concurrently herewith, the disclosure of which is hereby incorporated by reference in its entirety.
[0066] Therefore, in various embodiments, the microprocessor architecture according to various embodiments of the invention may permit the processor to operate in both closely coupled and decoupled modes of operation. In the closely coupled mode of operation, the SEVID program code fetch and program stream supply is exclusively handled by the processor core 10. In the decoupled mode of operation, the SIMD pipeline 53 executes code from a local memory 56 independent of the processor core 10. The processor core 10 may control the SMD pipeline 53 to execute video tasks such as audio processing, entropy encoding/decoding, discrete cosine transforms (DCTs) and inverse DCTs, motion compensation and de-block filtering.
[0067] With continued reference to the microprocessor architecture in Figure 1, the main processor pipeline 12 has been extended with a high performance SMD engine 50 and two direct memory access (DMA) engines 61 and 62, one for moving data into a local memory, SIMD data memory (SDM), and one for moving data out of local memory. The SMD engine 50 and DMA engines 61, 62 are all executing instructions that are fetched and issued from in the main processor pipeline 10. To achieve high performance, these individual engines need to be able operate in parallel, and hence, as discussed above, instruction queues (Q) are placed between the main processor core 10 and the SIMD engine 50, and between the SIMD 50 engine and the DMA engines 61, 62, so that they can all operate out of step of each other. In addition, in various embodiments, a local SIMD code memory (SCM) is introduced so that macros can be called and can be executed from these memories. This allows the main processor core, the SIMD engines and the DMA engines to execute out of step of each other.
[0068] Referring now to Figure 2, a block diagram illustrating a conventional multistage microprocessor pipeline having a pair of parallel data paths is depicted. In a microprocessor employing a variable-length pipeline, data paths required to support different instructions typically have a different number of stages. Data paths supporting specialized extension instructions for performing digital signal processing or other complex but repetitive functions may be used only some of the time during processor execution and remain idle otherwise. Thus, whether or not these instructions are currently needed will effect the number of effective stages in the processor pipeline.
[0069] Extending a general-purpose microprocessor with application specific extension instructions can often add significant length to the instruction pipeline, m the pipeline of Figure 2, pipeline stages Fl to F4 at the front end 100 of the processor pipeline are responsible for functions such as instruction fetch, decode and issue. These pipeline stages are used to handle all instructions issued by the microprocessor. After these stages, the pipeline splits into parallel data paths 110 and 115 incorporating stages El- E3 and D1-D4 respectively. These parallel sub-paths represent pipeline stages used to support different instructions/data operations. For example, stages E1-E3 may be the primary/default processor pipeline, while stages D1-D4 comprise the extended pipeline designed for processing specific instructions. This type of architecture can be characterized as coupled or tightly coupled to the extent that regardless of whether instructions are destined for default pipeline stages E1-E3 or extended pipeline D1-D4, they all must pass through stages F1-F4, until a decision is made as to which portion of the pipeline will perform the remaining processing steps. [0070] By using the single pipeline front-end to fetch and issue all instructions, the processor pipeline of Figure 2 achieves the advantage that instructions can be freely intermixed, irrespectively of whether the instructions are executed by the data path in sub-paths E1-E3 or D1-D4. Thus, all instructions appear as a single thread of program execution. This type of pipeline architecture also has the advantage of greatly simplified program design and debugging, thereby reducing the time to market in product developments. It is admittedly a highly flexible architecture. However, a limitation of this architecture is that the sequential nature of instruction execution significantly limits the exploitable parallelism between the data paths that could otherwise be used to improve overall performance. This negatively affects performance relative to other parallel pipeline architectures.
[0071] Figure 3 is a block diagram illustrating another conventional multiprocessor architecture having a pair of parallel instruction pipelines. The processor pipeline of Figure 3 contains a front end 120 comprised of stages F1-F4 and a rear portion 125 comprised of stages E1-E3. However, the processor also contains a parallel data path having a front end 135 comprised of front end stages G1-G2 and rear portion 140 comprised of stages D1-D4. Unlike the architecture of Figure 2, this architecture contains truly parallel pipelines to the extent that both front portions 420 and 435 each can fetch instructions separately. This type of parallel architecture may be characterized as loosely coupled or decoupled because the application specific extension data path G1-G2 and D1-D4 is autonomous and can execute instructions in parallel to the main pipeline consisting of F1-F4 and E1-E3. This arrangement enhances exploitable parallelism over the architecture depicted in Figure 2. However, as the two parallel pipelines become independent, mechanisms are required to synchronize their operations, as represented by dashed line 130. These mechanisms, typically implemented using specific instructions and bus structures which, are often not a natural part of a program and are inserted as after-thoughts to "fix" the disconnect between main pipeline and extended pipeline. As consequence of this, the resulting program utilizing both instruction pipelines becomes difficult to design and optimize. [0072] Referring now to Figure 4, a block diagram illustrating a dynamically decoupleable multi-stage microprocessor pipeline according to at least one embodiment of the invention is provided. The pipeline architecture according to this embodiment ameliorates at least some and preferably most or all of the above-noted limitations of conventional parallel pipeline architectures. This exemplary pipeline depicted in Figure 4 consists of a front end portion 145 comprising stages F1-F4, a rear portion 150 comprising stages E1-E3, and a parallel extendible pipeline having a front portion 160 comprising stages G1-G2 and a rear portion 165 comprising stages D1-D4. hi the pipeline depicted in Figure 4, instructions can be issued from the CPU to the extendible pipeline Dl to D4. To decouple the extendible pipeline Dl to D4 from the front portion 145 of the main pipeline Fl to F4, a queue 155 is added between the two pipelines. The queue serves to delay execution of instructions issued by the front end portion 145 of the main pipeline if the extension pipeline is not ready. A tradeoff can be made during system design to decide on how many entries should be in the queue 155 to ensure that the extension pipeline is sufficiently decoupled from the main pipeline. Additionally, in various embodiments, the main pipeline can issue a Sequence Run (vmri) instruction to instruct the extension pipeline to use its own front end 160, Gl to G2 in the diagram, to execute instruction sequences stored in a record memory 156, causing the extension pipeline to fetch and execute instructions autonomously. In various embodiments, while the extension pipeline, G1-G2 and Dl- D4, is performing operations, the main pipeline can keep issuing extension instructions that accumulate in the queue 155 until the extension pipeline executes a Sequence Record End (vendrec) instruction. After the vendrec instruction is issued, the extension resumes executing instructions issued to the queue 155.
[0073] Therefore, instead of trying to get what effectively becomes two independent processors to work together as in the pipeline depicted in Figure 3, the pipeline depicted in Figure 4 is designed to switch between being coupled, that is, executing instructions for the main pipeline front end 145, and being decoupled, that is, during autonomous runtime of the extended pipeline. As such, the instructions vrun and vendrec, which dynamically switch the pipeline between the coupling states, can be designed to be light weight, executing in, for example, a single cycle. These instructions can then be seen as parallel analogs of the conventional call and return instructions. That is, when instructing the extension pipeline to fetch and execute instructions autonomously, the main processor pipeline is issuing a parallel function call that runs concurrently with its own thread of instruction execution to maximize speedup of the application. The two threads of instruction execution eventually join back into one after the extension pipeline executes the vendrec instruction which is the last instruction of the program thread autonomously executed by the extension pipeline.
[0074] hi addition to efficient operation, another advantage of this architecture is that during debugging, such as, for example, instruction stepping, the two parallel threads can be forced to be serialized such that the CPU front portion 145 will not issue any instruction after issuing vrun to the extension pipeline until the latter fetches and executes the vendrec instruction. In various embodiments, this will give the programmer the view of a single program thread that has the same functional behavior of the parallel program when executed normally and hence will greatly simplify the task of debugging.
[0075] Another advantage of the processor pipeline containing a parallel extendible pipeline that can be dynamically coupled and decoupled is the ability to use two separate clock domains, hi low power applications, it is often necessary to run specific parts of the integrated circuit at varying clock frequencies, in order to reduce and/or minimize power consumption. Using dynamic decoupling, the front end portion 145 of the main pipeline can utilize an operating clock frequency different from that of the parallel pipeline 165 of stages D1-D4 with the primary clock partitioning occurring naturally at the queue 155 labeled as Q in the Figure 4.
[0076] Referring now to Figure 5, a flow chart of an exemplary method for sending instructions from a main processor pipeline to an extended processor pipeline according to at least one embodiment of the invention is depicted. Operation of the method begins in step 200 and proceeds to step 205, where an instruction is fetched by the main processor pipeline. In step 210, because the instruction is determined to be one for processing by the parallel extended pipeline, the instruction is passed from the main pipeline to the parallel extended pipeline via an instruction queue coupling the two pipelines. In various embodiments, if the parallel extended pipeline is currently processing instructions from the queue, that instruction will be processed in turn by the parallel extended pipeline as specified in step 220. Otherwise, the instruction will remain in the queue until the parallel extended pipeline has ceased its autonomous operation. In step 225, while the instruction is either sitting in the queue or being processed by the parallel pipeline, the main pipeline is able to continue processing instructions. The queue provides a mechanism for the main pipeline to offload instructions to the parallel extended pipeline without stalling the main pipeline. Operation of the method stops in step 230. Referring now to Figure 6, this Figure is a flow chart of an exemplary method for dynamically decoupling an extended processor pipeline from a main pipeline according to at least one embodiment of the invention. Operation of the method begins in step 300 and proceeds to step 305 where the main processor pipeline sends a run instruction to the parallel extended pipeline via the instruction queue coupling the pipelines. In step 310, the parallel pipeline retrieves the run instruction from the queue. As noted above, this may occur instantly or after the parallel pipeline has retrieved and processed other instructions in front of the run instruction in the queue. In various embodiments, this run instruction will specify a location in a record memory accessible by the parallel extended pipeline of a starting location of a sequence of recorded instructions. Next, in step 315, based on receipt of the run instruction, the parallel extended pipeline begins executing the series of recorded instructions, that is, it begins autonomous operation, hi various embodiments this comprises fetching and executing its own instructions independent of the main pipeline. Also, in various embodiments, the parallel extended pipeline may operate at another clock frequency to that of the main pipeline, such as, for example, a fractional percentage (i.e., 1A, 1A, etc.). Concurrent to the parallel extended pipeline's autonomous execution, the main processor pipeline can continue sending instructions to the parallel extended pipeline as depicted in step 320. Then, in step 325, after the parallel pipeline has processed an end instruction recorded at the end of the sequence of recorded instructions, autonomous operation of that pipeline ceases, hi step 330, the parallel pipeline returns to the queue to process any queued instructions received from the main pipeline. In step 335, the parallel extended pipeline continues processing instructions issued by the main pipeline that appear in the queue until an instruction to begin autonomous operation is received.
[0078] As discussed above in the context of Figure 1, general purpose microprocessors, including embedded microprocessors, are sometimes extended with co-processors, additional extension instructions, and/or pipeline extensions, all collectively referred to hereafter as "processor extensions." A processor extension typically supports specialized instructions that hugely accelerate the computation required by the application that the instruction is designed for. For example, SIMD extension instructions can be added to a processor to improve performance of applications with high degree of data parallelism. Traditionally, there are two ways by which such specialized instructions are issued. Firstly, the instructions can be issued directly from the CPU or main processor pipeline to the processor extension through a tightly coupled interface as discussed above in the context of Figure 2. Secondly, the CPU can preload the instructions into specific memory locations and the processor extension is then instructed by the CPU to fetch and execute the preloaded instructions from memory so that the processor extensions are largely decoupled from the CPU, as discussed in the context of Figure 3.
[0079] In view of the shortcomings of these two traditional methods, various embodiments of this invention propose an innovative alternative in which processor extension instructions are issued by the CPU (main processor pipeline) and dynamically captured into a processor extension memory or processor extension instruction buffer/queue for subsequent retrieval and playback. In various embodiments, processor extension instructions can optionally be executed by the processor extensions as they are captured and recorded.
[0080] By way of example, consider code fragment A of Figure 7. In this code fragment, all instructions from statement Ll to just before statement L3 are to be issued to the extended instruction pipeline. In this case, these extension instructions are intermixed with general-purpose instructions and the extension instructions are issued to the processor extension by the CPU, through retrieval of the instructions from CPU instruction memory.
[0081] One problem with this approach is that intermixing instructions makes execution in the CPU and the processor extension difficult to decouple. Additionally, extension instruction sequences are typically used in several places in an application. However, the way that these instructions are included in code fragment A does not allow for reductions in overall code size. An increase in overhead to the standard CPU code execution performance is also associated with the issuing of extension instructions due to the number of cycles consumed in the transport of processor extension instructions as well as the CPU instruction cache occupancy overhead due to storage of processor extension instructions.
[0082] As an alternative to this approach of loading instructions whenever they are need, in various embodiments of the invention, an extension instruction sequence can be preloaded into some specific memory location from which the processor extension logic is directed to fetch such instructions, as shown in code fragment B in Figure 8. In code fragment B, the extension instruction sequence is preloaded to location LlOO and then a Sequence Run (vruri) instruction is issued in statement L5 to direct the processor extension to fetch and execute the sequence. However, to dynamically preload such a sequence in a CPU with load/store architecture, each instruction has first to be loaded into a register in the CPU and then stored at the desired location, requiring at least 2 instructions (a load and a store). Additional overhead is also incurred by the need to track the number of instructions to be loaded and to increment the addresses of the targeted memory locations. Furthermore, if the extension instruction sequence is adaptive, that is, based upon the run-time conditions in the CPU, the preloading routine, referred to as the preloader, would need linking functionalities to modify the sequence while preloading. Such functionalities add to the preloading overhead. An example of adaptation is L2 in code fragment A of Figure 5 in which a CPU register rlO is read in addition to the extension register vrOl. The cumulative effect of all these overheads can significantly reduce application performance if the extension instruction sequences have to be dynamically reloaded relatively frequently as is likely in video processing applications. [0083] Thus, in various embodiments, this invention introduces a scheme by which, instead of preloading, extension instruction sequences can be captured on-the-fly, that is, while such instructions are being issued from the CPU, and recorded to specific memory locations accessible by the extension logic. The instructions being recorded can also be optionally executed by the processor extension, further reducing the recording overhead.
[0084] Referring now to the code fragment C in Figure 9, in this fragment the Sequence
Record (vrec) instruction in statement LlA initiates a recording session to record all extension instructions issued by the CPU to the memory locations starting at LlOO. The Sequence Record End (vendrec) instruction in statement L2C terminates the recording session. This type of record instruction is referred to herein as an instruction macro. Once the instruction macro is recorded, the CPU can then direct the processor extension to fetch and execute the instruction macro using only the vrun instruction, for example, in statement L5 of code segment C. The overhead in recording the macro is now constrained by the rate in which the CPU can issue extension instructions, which is typically one instruction per cycle, and is significantly less than the overhead in instruction preloading. Also, it becomes trivial to adapt the instruction macro based on runtime conditions in the CPU. There are two such examples of adaptation in code fragment C. Ih the first example, when issuing the vbmulf instruction in statement L2, the CPU can read its own register rlO and its value is issued directly to the processor extension together with the instruction and recorded into the macro. In the second example, the breq instruction in statement L2A is actually a conditional branch instruction of the CPU that depends on the contents of the CPU registers r4 and r5. If this branch is taken, the vaddw instruction in statement L2B will not be issued to the processor extension and hence not recorded. In various embodiments, a mechanism is used to keep track of address locations in the SCM such that during the recording of subsequent additional instruction sequences, previous instruction sequences are not overwritten and such that different instruction sequence start addresses are maintained by the main processor core.
[0085] A further advantage of instruction recording over preloading is the elimination of the requirement to load the extension instruction sequences into data cache using the preloader, which would have polluted the data cache and thereby reduce overall efficiency of the CPU. Furthermore, by replacing the vrec instruction in statement Ll A by the Sequence Record And Run (vrecmri) instruction, the instruction being captured and recorded is also executed by the processor extension and the overhead of instruction recording is thereby reduced or even minimized. Once recorded, an instruction macro can be used in the same way as a preloaded instruction sequence and has the same benefits of code reuse and simplifying decoupled execution. In various embodiments, the record mechanism can coexist with the preloading mechanism, that is, the two mechanisms are not necessarily mutually exclusive. As an example, preloading may still be useful for preloading macros that do not require frequent reloading in runtime.
[0086] hi various embodiments, in order to increase and ideally maximize flexibility, the processor extension can operate in one of two modes. In various embodiments, after executing the Sequence Run (yruή) instruction, the processor extension may switch to an autonomous mode in which it fetches and execute instructions in a pre-recorded macro on its own. After executing the Sequence Record End (vendrec) instruction that signifies the end of an instruction macro, the processor extension may switch back to the normal operating mode, in which the CPU provides all further processor extension instructions. As a result of this flexibility, this recording scheme combines all the benefits of direct instruction issuing and preloading.
[0087] Referring now to Figure 10, this Figure is a flow chart of an exemplary method for recording instructions in an extended instruction pipeline and using such recorded instructions according to at least one embodiment of the invention. The method begins in step 400 and proceeds to step 405, where the main processor pipeline issues a record instruction sequence instruction to the extended instruction pipeline. In various embodiments, as discussed above, this record sequence instruction will specify a starting memory address. In step 410, the extended pipeline begins recording the sequence of instructions following the record instruction in a memory structure accessible by the extended pipeline at the starting location specified in the record instruction. It should be appreciated that, as discussed herein, in step 410 the extended pipeline may also begin executing the sequence of instruction in addition to recording them.
[0088] In step 415, the main pipeline issues the record end instruction to the extended pipeline causing the latter to stop recording the instruction sequence. In various embodiments, as indicated in step 420, the extended instruction pipeline may record the end record instruction as the last instruction in the current sequence. As discussed above, after the instruction sequence has been recorded, the main processor pipeline can call the instruction sequence with a single run instruction and effectively decouple the extended pipeline from the main pipeline, as exemplified in the remaining method steps of FIG. 10.
[0089] In step 425, the main processor pipeline calls the recorded instruction sequence, hi various embodiments as illustrated in Figures 8-9 and discussed in the corresponding description, this is accomplished by issuing a run instruction that specifies the start address of the instruction sequence. In this manner, different sequences may be called with the same run instruction by specifying different start addresses. By calling this recorded instruction, the main pipeline effectively decouples the extended pipeline so that the latter may begin fetching and executing instructions autonomously, as stated in step 430. As discussed above, in various embodiments, the extended pipeline has its own front end for this purpose. In various embodiments, the extended pipeline will continue operating in the autonomous mode, that is independent of main pipeline's fetch-execution cycles, until the "end" or "record end" instruction that was previously recorded at the end of the current instruction sequence is encountered, hi various embodiments, this instruction will cause the extended pipeline to cease autonomous execution and, as stated in step 435, to resume executing instructions issued by the main pipeline via the queue.
[0090] As discussed above, in the microprocessor architecture according to the various embodiments of the invention, a main processor pipeline is extended through a dynamically coupled parallel SIMD instruction pipeline. In various embodiments, the main processor pipeline may issue instructions to the extended pipeline through an instruction queue that effectively decouples the extended pipeline, hi various embodiments, the extended SIMD pipeline is also able to run prerecorded macros that are stored in a local SMD instruction memory so that a single macro instruction sent to the SIMD pipeline via the queue allows many pre-determined instructions to be executed. This architecture, among other things, allows the SIMD media engine (the extended pipeline) to operate in parallel with the primary pipeline (processor core) and allows the processor core to operate far in advance of the parallel SBvID pipeline.
[0091] One consideration of using an instruction queue to decouple the extended SIMD pipeline from the processor core (main pipeline) is that it becomes possible for the processor core to issue too many instructions causing the queue to become full. When the main processor pipeline can no longer issue instructions to the queue, the pipeline will have to stall until the queue frees up a slot for the instruction that caused the pipeline to stall. Pipeline stalls have a negative effect on overall system performance. In this case in particular, a pipeline stall means that the processor core will stop being able to operate in parallel, therefore negating the gains derived from the dynamically decoupled extended parallel SIMD pipeline.
[0092] Therefore, in order to prevent the main processor pipeline from issuing instructions to the queue when it is full, thereby causing the main pipeline to stall, in various embodiments, the SIMD pipeline queue uses condition codes to notify the processor pipeline of the condition of the queue. In various embodiments, the SIMD queue sets a condition code of QF for queue nearly full whenever there are less than a predetermined number of empty slots remaining in the queue. In various embodiments, this number may be 16. However, in various embodiments, the number may be different than 16. In various embodiments, the SIMD queue sets a condition code of QNF as the opposite of QF when more than the predetermined number of slots remain available.
[0093] In various embodiments, rather than using several instructions to load these status values and test the value before branching on the test result, two conditional branch instructions using these condition codes directly test for such conditions, thereby reducing the number of instructions required to perform this task. In various embodiments, these instructions will only branch when the condition code used is set. In various embodiments, these instructions may have the mnemonic "BQF" for branch when queue is nearly full and "BQNF" for branch when queue is not nearly full. Such condition codes make the queue full status an integral part of the main processor programming model and make it possible to make frequent light-weight intelligent decisions by software to maximize overall performance. These condition codes are maintained by the queue itself based on the queue's status. The instruction to check the condition code are branch instructions that are specified to check the particular condition codes. In various embodiments of the invention, checking of the condition code is done by placing condition code checking branch instructions where necessary, such as before issuing any instructions to the extended pipeline. Thus, the condition codes provide an easy mechanism for preventing main pipeline stalls caused by trying to issue instructions to a full queue.
[0094] These two conditional branch instructions allow the main processor pipeline to regularly check the status of the queue before issuing more instructions into the extended SIMD pipeline queue. The main processor core can use these instructions to avoid stalling the processor when the queue is full or nearly full, and branch to another task that does not involve the SIMD engine until these queue conditions change. Therefore, these instructions provide the processor with an effective and relatively low overhead means of scheduling work load on the available resources while preventing main pipeline stalls.
[0095] As discussed above, operating the main pipeline, extended pipeline and DMA engines in parallel introduces the problem of synchronization. For example, a sequence of SIMD code segment will have to wait for a DMA operation to finish transferring data into the SDM, which is kicked off by the instruction just preceding it. On the other hand, the DMA engine cannot start transferring data out of the SDM until the previously issued SIMD code has been executed. This type of synchronization is normally performed by using software to probe status bits toggled by these engines, or by using interrupts and their associated service routines to kick off the dependent processes. Both of these solutions require large overheads in terms of cycles as well as coding effort to achieve the synchronization desired. [0096] In order to reduce these overheads, in various embodiments of the invention, the DMA engines 61, 62 (Figure 1) are placed in the SIMD pipeline 53 itself, but each DMA engine is allowed to buffer one or more instructions issued to it in a queue without stopping the SIMD pipeline execution. When the DMA engine instruction queue is full, the SIMD engine pipeline 53 will be blocked from executing further instructions only when another DMA instruction arrives at the DMA. This allows the software to be re-organized so that a SIMD code will have to wait for a DMA operation to complete, or vice versa, as long as a double or more buffering approach is used, that is, two or more buffers are used to allow overlapping of data transfer and data computation.
[0097] With continued reference to the processor architecture of Figure 1, there are two DMA engines 61, 62, one for moving data into a local memory, one for moving data out of local memory. Each DMA channel is allowed to buffer at least one instruction in a queue. Suppose for example, that there are two independent video pixel data blocks to be processed, and that each requires multiple blocks of pixel data to be moved into local memory and to be processed, before moving the results out of local memory.
[0098] Referring to Figure 11, this Figure illustrates an instruction sequence flow diagram 450 and corresponding event time line 455 illustrating a method for synchronizing processing between DMA tasks and SIMD tasks, with only one deep instruction queues in each DMA engines, according to at least one embodiment of the invention. Looking at the instruction sequence flow diagram 450, the DI2 DMA operation is blocked if the buffered DIl DMA operation is not completed, causing the DI2 DMA instruction to be blocked from entering the DMA instruction queue, which in turn results in the Sl SIMD operation being blocked. Since Sl operation depends on data from DIl operation, the blocking action prevents the Sl SIMD instruction sequence from proceeding until the DIl operation is completed. The DI3 DMA operation is executed only after Sl is completed. This eliminates any chance of DI3 overwriting the same data region targeted by the DIl operation before the data is used by the computation Sl. By the time DI3 has completed, the DI2 operation would have completed, allowing S2 to start. If however, the DI2 operation is not completed, the DI3 operation will be blocked, preventing S2 from starting. Likewise, the DO operation is only executed when S4 has completed. It should be appreciated that in the timeline 110 of Figure 2, DI2 and Sl, DB and S2, and DI4 and S3 are shown as starting at the same time respectively. In actual operation, Sl will start one clock cycle after DI2, S2 will start one clock cycle after DB, and S3 will start one clock cycle after DI4. The time line is intended to demonstrate that Sl cannot start before DIl is complete, S2 can not start before DI2 is complete, S3 can not start before DB is complete, and S4 can not start before DI4 is complete.
[0099] This approach avoids the need of the main processor core from intervening continuously in order to achieve synchronization between the DMA unit and the SIMD pipeline. However, the processor core 10 does need to ensure that the instruction sequence sent uses this functionality to achieve the best performance by parallelizing SEVID and DMA operations. Thus, an advantage of this approach is that it facilitates the synchronization of SIMD and DMA operations in a multi-engine video processing core with minimal interaction between the main control processor core. This approach can be extended by increasing the depth of the DMA non-blocking instruction queue so as to allow more DMA instructions to be buffered in the DMA channels, allowing double, triple or more buffering.
[0100] Referring now to Figure 12, this Figure is a flow chart of an exemplary method for synchronizing multiple processing engines in a microprocessor-based system according to at least one embodiment of the invention. Figure 12 demonstrates a method for coding the instruction sequence to allow both the SIMD engine and DMA engines to operate simultaneously as much as possible. The method begins in step 500 and proceeds to step 505 where an instruction requiring the DMA engine is executed by the SIMD pipeline. In step 510, the SIMD pipeline accesses the required DMA engine queue. If in step 510, the DMA engine instruction queue is already full when it is accessed, the SIMD pipeline is paused from further execution, as described in step 515. hi step 520, the SIMD waits for a free space in the instruction queue of the targeted DMA engine. In the meantime, the DMA engine corresponding to the target queue performs its current DMA operation instructed by the DMA instruction(s) already in the queue. After this operation is performed, the DMA engine instruction queue opens up a free space so that in step 525, the stalled DMA instruction can be buffered in the queue. The SIMD pipeline then resumes execution in step 530 after the DMA instruction has been buffered. Accordingly, through the various systems and methods disclosed herein, simultaneous operation of the SIMD pipeline and the DMA engines is maximized without the risk of overwrite.
[0101] As discussed herein, a primary task in video encoding is block matching. Block matching achieves high video compression ratios by finding a region in a previously encoded frame that is a close match for each macro block in a current video frame. The spatial offset between the current block and reference block is called a motion vector. Block matching algorithms compute the pixel-by-pixel difference between a selected block of a reference frame and current block. Temporal redundancy between blocks in subsequent frames allows the encoder to encode the video without encoding the pixel values of each block.
[0102] Accordingly, various embodiments of the invention provide a flexible and efficient systolic array-based block matching algorithm that can be configured to match blocks of size 4x4, 4x8, 8x4 and 8x8 pixels, etc., to provide support for variable block sizes in the H.264 and most other modern video codec standards. Referring now to Figure 13, a block diagram illustrating an architecture for performing the sum of absolute difference (SOAD) calculation for block matching according to at least one embodiment of the invention is depicted. In the embodiment of Figure 13, the architecture 600 consists of four primary components: the sequencer 605, the block matching unit 610, the reference picture buffer 615, and the DMA unit 620. In various embodiments, the sequencer 605 functions as the control unit executing the search sequence. The block matching unit 610 comprises a wide data path, which, in various embodiments, is able to load 8 reference pixels and cycle through an 8x8 block, such that each row of reference pixels is used 8 times. In various embodiments the reference picture buffer 615 is used to store a large number (several blocks) of reference pixels to reduce the required bandwidth between the block matching unit 610 and system memory. As discussed herein, in various embodiments, the DMA unit 600 may comprise a pair of DMA units such as, for example, data input and data output units. [0103] Referring now to Figure 14, a block diagram depicting an exemplary cell 650 in the systolic array-based block matching algorithm architecture according to at least one embodiment of the invention is provided. In various embodiments, each cell 650 of the systolic array computes an 8-bit absolute difference between target and reference pixels. As the target and reference pixels move across the systolic array, the absolute difference of each such pair of pixels is computed and accumulated. In various embodiments, each row computes the sum of the 8 cell results. Thus, eight cycles after starting a block calculation, a row will produce a block sum of absolute difference (SOAD) result. Xt represents the pixel of the target block at time t and rt represents the pixel of the reference block at time t. In various embodiments the start of each row calculation is staggered by 1 clock cycle so that results are emitted 1 per cycle. This allows efficient re-use of hardware because each of the N rows can share a single adder. Various embodiments will employ an 8x8 array of systolic cells like the cell 200 of Figure 14, such that cell [ij] is row j and column i, with number increasing from top to bottom and from left to right. Xt-1 shows that the pixel of the target block is propagated down the array each time cycle. Also, Tt-1 represents the pixel of the reference block presented at time t. This too propagates down the array so that the same row of the target block is being matched to the corresponding row of the reference block.
[0104] As seen in Figure 15, in various embodiments, each cell comprises three 8-bit registers, Py, Rj j and Aj j. Also, each row j of the array of systolic cells may contain two additional registers Carry[j] and Sum jj] which need to be at least 14 bits in size. The connectivity of the cells in the systolic array is such that the output of P[ij] is connected to the input of P[i, k], where k = j + 1, mod 8. Each P[ij] has a multiplexer at its input to allow values to be loaded externally. The output of R[ij] is connected to the inputs R[i-l,q] where q = j + 1, mod 8. Thus, the difference of each pixel is output to the row accumulator. After a reference block has been searched, the target block values in the bottom of the array are copied to the top of the array so that the target block keeps cycling through the array with each newly loaded reference block.
[0105] Description of the various embodiments of the systolic array will be described in method steps in the context of Figures 16 and 17. The method begins in step 700 and proceeds to step 705 where the pixel block from the target region is loaded. In various embodiments, this comprises loading an 8x8 pixel-block from the target region into the systolic array. The block matching algorithm is searching for a match with this target region. In various embodiments, the pixel values are loaded by presenting each row of 8 values from the target block on successive cycles to signals P[i,0] (for i from 0 to 7) at the top of an 8x8 array of cells. At each cycle the already-loaded P[ij] values move vertically down the 8x8 matrix of cells, to location P[ij+1], so that after 8 such cycles all 8 rows of the systolic array have been loaded with the target block. Figure 17 illustrates the systolic array 750 and 64-bit word from the search space.
[0106] Next, in step 710, to begin searching, the first 64-bit word from the top-left hand corner of the search space is fetched from memory and inserted into the triangular input array at the right-hand side of the systolic array, that is, the first 64-bit word of the reference blocks. In step 715, the sum of absolute difference (SOAD) is computed for each row. It should be appreciated that in various embodiments, the systolic array needs to be primed for 8 cycles before it begins the first SOAD calculation. This is to allow for the first input word to propagate through the triangular array of input cells and arrive at the row 0 of the systolic array. Alternatively, the first 8 results can be discarded. This will yield the same results as waiting 8 cycles before beginning to perform SOAD calculations. This SOAD value is stored in step 720 and the block is incremented through another eight clock cycles in step 725. In various embodiments, the loading of 64-bit words from the search space continues at the rate of one word per cycle for the remainder of the search. Then, in step 730, after a SOAD value has been calculated for each row in the search space, the block with the minimum SOAD value is determined. In various embodiments, this block is considered a match.
[0107] The N x M blocks of the search space are scanned sequentially, proceeding horizontally across the search space, such that a new column of pixels is added and an old one dropped from the target block on each sequence until each vector has been tested. Each block is loaded into the systolic array, starting at row 0 of the block and continuing to row 7, before moving on to consider the next block (that is, incrementing the block by one column) in the search space. This defines an address pattern that must be followed when fetching words from the search space. When the last word of the last block in the N x M search space has been loaded, the search terminates.
[0108] Figure 18 illustrates an exemplary target array 760 of 8x8 pixels 500 to be loaded in the systolic array and an exemplary search space in which the matching block in the search space is located at motion vector 12, in the X direction from the top left of the search space. In various embodiments, the systolic array operates such that on each clock pulse, all A[i,j] values are added together using a carry-save adder tree, sometimes referred to in the industry as a Wallace Tree. This produces two 14-bit values which, are assigned to the Carryjj] and Sum[j] registers on each successive clock pulse.
[0109] As discussed herein, when the array is matching blocks of 8x8 pixels, each row produces one SOAD value every eight clock pulses. Each result represents the sum of the absolute differences between the target block and a reference block from the search space. Row k produces a SOAD value one cycle after row j, where k = j-1, mod 8. Thus, results appear one per cycle in a cyclic manner starting with row 0, and continuing to rows 1, 2, ..., and 7. In various embodiments, these results can be stored. In various other embodiments these results are fed into a block of logic to compute the minimum of all computed SOAD values and to associate this minimum value with the position of the corresponding block in the NxM search space. The position of the block in the search space that generates the minimum SOAD values defines a potential motion vector for performing motion-compensation in a block- based video encoder such as the H.264 codec.
[0110] Another feature of the various embodiments of the invention is the ability to perform sub-block searches. In various embodiments, the carry-save adder associated with each row of 8 cells can be partitioned into two smaller carry-save adders and two separate pairs of carry/sum registers. Carry[u][j] and Sum[u][j], foru {0,1}. In various embodiments this partitioning can be controlled by a mode bit so that the full- row computation or the dual half-row computation can be set at run time. In the half- row mode of operation, each row computes for 4 cycles rather than the 8 cycles described in the context of the 8x8 blocks. At the end of each 4-iteration cycle, each row produces two SOAD values representing the closeness of the match between two adjacent 4x4 blocks from the target block and two adjacent 4x4 blocks from the search space. Four such SOAD values, representing the closeness of match for the four 4x4 quadrants of an 8x8 block, can be added together to generate the SOAD value for an 8x8 block. If for example, quadrants A, B, C, and D are equal-sized sub-blocks of an 8x8 block, the generated 4x4 SOAD values can be added pair-wise to generate all possible sub-block SOAD values for 8x4 or 4x8 blocks.
[0111] As with 8x8 SOAD values, it is possible to forward 4x4 SOAD values at the rate of two per cycle to a subsequent block of logic to compute the lowest SOAD value for A, B, C, D, (A+B), (C+D), (A+C), (B+D), and (A+B+C+D). For each of these minima, the subsequent logic must store a motion vector (x,y) where x and y are the horizontal and vertical coordinate of the sub-block generating each minimum SOAD value.
[0112] Various embodiments of the invention may also provide for storing more than one minimum SOAD value for the target block, or each sub-block if performing a sub- block search. This allows the logic to compute not only the "best matching block" but also the "next best matching block." If these two blocks differ by at most 1 in both x and y directions, then it may be possible to find a better match by searching at sub- pixel resolution. This can be performed through conventional techniques by using an up-sampling technique (such as, for example, bilinear or bicubic interpolation) over a smaller search space defined by a perimeter that is the smallest rectangle enclosing the two best matching blocks. In various embodiments, when an upsampled search space has been computed, the same systolic array can be used to search this space using the same method. In this case, the target block does not need to be loaded because it will already be in place within the array.
[0113] Thus, the systolic array-based systems and methods according to the various embodiments discussed herein, provides various improvements over existing systems and methods. Through the novel architecture disclosed herein, the block matching array is provided with the ability to load the array with 64-bit values that are always read from 64-bit aligned memory locations. This is an advantage over previous schemes, which load the systolic array with 8-bits per cycle. Also, previous systems and methods have been limited in the range of N and M3 i.e. in the definition of the search space. The systems and methods according to the various embodiments of the invention can search an unbounded space. Also, the use of carry-save adders reduces the gate count of the resulting logic because there are fewer temporary SOAD values, thereby simplifying design and reducing costs. Any systolic array that passes partial SOAD values from cell to cell must contain a partial SOAD register in each cell. These are 14 bits in size, which is nearly twice the size of the 8-bit local absolute difference value (A[i J]) needed in this scheme. In the systolic-array based block matching according to the various embodiment of the invention, only two 14-bit registers are needed per row.
[0114] Another advantage of the systolic array-based block matching scheme according the various embodiments of the invention is that the carry save register values need to be added before the final SOAD value is made available. As the final Carry/Sum values resulting from each row are produced at different times, the same full-adder circuit can be shared by all rows. Alternatively, if computing 4x4 sub-block SOAD values, only two full-adders will be required. The ability to select between computing 1 8x8 or 4 4x4 SOAD values per cycle is another improvement compared to previous schemes. This is particularly useful for encoding video streams using the H.264 standard, which supports motion compensation on variable block sizes, i.e., 4x4, 8x4, 4x8 and 8x8.
[0115] It should be appreciated that the systolic array-based block matching system and method discussed above can be extended to match blocks of dimension PxQ, where P is {4, 8, 16} and Q is any multiple of 4. For example, to perform block matching on blocks of dimension 16x16 pixels, the systolic array would be 16x16 instead of 8x8 as discussed above. Alternatively, four 8x8 SOAD values can be combined to produce a 16x16 block SOAD using the configuration discussed above in the context of Figures 5 and 6. Such modifications are within the spirit and scope of the invention. Thus, the systolic array-based block matching system and method according to the various embodiments of the invention is highly flexible.
[0116] Referring now to Figure 19, a diagram illustrating the components of a parameterizable clip instruction 800 for either SIS or SIMD processor architectures according to at least one embodiment of the invention is provided. As discussed above, algorithms in numerical computations, such as those common in video encoding/decoding, often require results to be clipped to be within a specified range of values. For example, in video processing, a system will have a maximum pixel depth depending on the system's resolution. If the value of an intermediate calculation result, such as an interpolation or other calculation lies outside the maximum value the final result will have to be clipped to a saturation value, for example, the maximum pixel value.
[0117] Conventionally, clipping is implemented in software using a sequence of instructions that first test the intermediate value and then conditionally assign the final value, for example, if value > maximum, then value = maximum. Such a software clipping implementation incurs a high overhead due to the number of calculations required to test each value. The sequential nature of a software implementation makes it very difficult to be optimized in processors designed to exploit instruction level parallelism, such as, for example, SIS reduced instruction set (RISC) machines or very long instruction word (VLIW) machines. Some processors do implement clipping at the hardware level using specialized processor instructions, however, the clipping ranges of these instructions are fixed to some value, typically a power of two. Therefore, various embodiments of this invention provide a parameterizable clip instruction for a microprocessor that enables adjustment of clipping parameters.
[0118] Referring to Figure 19, the instruction 800 labeled "VBLCIP" contains three elements, rd, rb and re. Rb and rd are the source and destination register addresses respectively. That is, rb is the register address of the value to be clipped and rd is the register address where the clipped value is to be written. Rc is the controlling parameter for the instruction. The value of re dictates how the value located at address rb will be clipped. This instruction permits 8 16 bit values to be clipped within the range specified by the control parameter re.
[0119] Figure 20 illustrates the format of controlling parameter re in the form of a 32-bit operand and FIG. 21 is a table illustrating the ways in which the parameters of the parameterizable clip instruction may be specified. As seen from these Figures, in this example, the input re is a 32 bit input. However, it should be appreciated that depending upon the native word size of the processor, re may be 16, 32, 64, 128 or other bit size. In various embodiments, the most significant 16 bits, that is, bits 31 to 16 are unused as seen in the table. In various embodiments, bits 15 and 14 are reserved for the range type, while bits 13-0 are used for the range specifier.
[0120] In the example of Figure 21, four range types are available. Specifically, range types of [0, 2N-1], [-N, N], [-2N, 2N-1] and [0, N] corresponding to 2-bit binary values 00, 01, 10 and 11. The remaining 14 least significant bits, bits 13 to bit 0 are used to represent N, the range specifier. These bits contain a binary number having a maximum value of 11111111111111 (16383). Thus, by using range type 01 or 11, ranges not limited to powers of two may be used.
[0121] In the table 810 of Figure 21 , the range specifier N is itself a parameter supplied to the VBCLIP instruction 100. The bit type RT specifies one of the four possible ways the clipping range can be defined using the range specifier N. Range types 00 and 10 are designed to work with unsigned and signed clipping ranges respectively, while types 01 and 11 are designed to work with signed and unsigned clipping ranges that are not powers of two. The VBCLEP instruction is therefore a highly flexible processor implementation of clipping. In addition, though the example of FIGS. 20 and 21 describes VBCLIP as an SIS instruction, the instruction syntax can easily be extended to SHVID architectures in which both registers rb and re are vector registers. In this case, clipping, as specified in re, is applied to each slice of the vector register rb with the results assigned to the corresponding slice in rd. An additional advantage of a SIMD version of the clipping instruction is bypassing the data dependent sequential nature of clipping operations that is awkward to implement in parallel machines.
[0122] Referring now to Figure 22, this figure is flow chart an exemplary method for performing a clip operation with a parameterizable clip instruction according to at least one embodiment of the invention. The method begins in step 900 and proceeds to step 905 where the clip instruction is fed to the microprocessor pipeline. As discussed above in the context of Figures 19-21, in various embodiments, the instruction comprises an instruction taking the form of a name and three input operands: a destination address, a source address and a controlling parameter. Then, in step 910, the data to be operated on is fetched from the source address specified in the instruction. Also, in step 915, the range type indicated in the instruction is referenced to determine the actual range after decoding the instruction. In various embodiments, the range type is represented by two bits of the input operand's controlling parameter re. hi various embodiments, a table is stored in a memory register of the processor that maintains a list of the range types indexed by the two-bit code. Li step 920, the range specifier is extracted from the instruction and using the range type, a range is determined. In step 925, the value fetched in step 910 is clipped in accordance with the range determined in step 920. In step 930 the result is written to the destination address specified in the destination address input operand rd of the instruction. Operation of the method stops in step 935.
[0123] As noted above, the SIMD architecture is particularly well suited for applications such a media processing including audio, images and video due to the fact that a few operations are repeatedly performed on relatively large blocks of data. This makes the SIMD architecture ideal for implementing video compression/decompression (codec) algorithms. Video codecs are at the heart of nearly all modern digital video products including DVD players, cameras, video-enabled communication devices, gaming systems, etc.
[0124] Most image and video compression algorithms take advantage of redundancy in the image and even in successive frames to store less than all the information necessary to fully characterize the image. As a result, these algorithms are considered "lossy." That is, the original uncompressed image can not be perfectly (from a mathematical perspective) reconstructed from the comprised data because some data has been lost in the compression process. Thus, compression is inherently a balancing act between the competing goals of minimizing the number of bits required to represent the image and ensuring that the differences between the original (uncompressed) image and the reconstructed image are minimized or at least not perceptible or objectionable to the human eye.
[0125] Ideally, lossy image and video compression algorithms discard only perceptually insignificant information so that to the human eye the reconstructed image or video sequence appears identical to the original uncompressed image or video. In practice, some artifacts may be visible. This can be attributed to poor encoder implementation, video content that is particularly challenging to encode, or a selected bit rate that is too low for the video sequence, resolution and frame rate.
[0126] Two types of artifacts, "blocking" and "ringing" are common in block-based lossy video compression applications. Blocking artifacts are due to the fact that compression algorithms divide each frame into blocks of 8x8 pixels, 16x16 pixels, etc. Each block is reconstructed with some small errors, and the errors at the edge of a block often contrast with the errors at the edges of neighboring blocks, making block boundaries visible. Ringing artifacts appear as distortions or blurs around the edges of image features. Ringing artifacts are due to the encoder discarding too much information in quantizing the high-frequency transform coefficients.
[0127] Video compression applications often employ filters following decompression to reduce blocking and ringing artifacts. These filtering steps are known as "deblocking" and "deringing," respectively. Both deblocking and deringing may be performed using low-pass FIR (finite impulse response) filters to hide these visible artifacts.
[0128] Two emerging video codec standards designed to facilitate high quality video required by today's electronic devices are the H.264 and VCl standards. H.264 was jointly developed by the Moving Picture Experts Group (MPEG) and the International Telecommunication Union (ITU). It is also known as MPEG-4 Part 10 Advanced Video Coding (AVC). VCl is a video codec specification based on MICROSOFT WINDOWS Media Video (WMV) 9 compression technology that is currently being standardized by the Society of Motion Picture and Television Engineers (SMPTE).
[0129] One key attribute of a video compression application is the bit-rate of the compressed video stream. Codecs that target specific applications are designed to stay within the bit-rate constraints of these applications, while offering acceptable video quality. DVDs use 6-8 Mbps with MPEG-2 encoding. However, emerging digital video standards such as HDTV and HD-DVD can demand up to 20-40 Mbps using MPEG-2. Such high bit-rates translate into huge storage requirements for HD-DVDs and a limited number of HDTV channels. Thus, a key motivation for developing a new codec is to lower the bit-rate while preserving or even improving the video quality relative to MPEG -2. This was the motivation that led to the development of both the H.264 and VCl codecs. These codecs achieve significant advances in improving video quality and reducing bandwidth, but at the cost of greatly increased computational complexity at both the encoder and decoder.
[0130] A deblocking filter operation is specified by both the H.264 and VCl codecs in order to remove blocking artifacts from each reconstructed frame that are introduced by the lossy, block-based operations. Each video frame is divided into 16x16 pixel macroblocks and each macroblock is further divided into sub-blocks of various sizes for transforms. The deblocking filter is applied to all the edges of such sub-blocks. For each block, vertical edges are filtered from left to right first and then horizontal edges are filtered from top to bottom. The deblocking process is repeated for all macroblocks in a frame.
[0131] When source data is organized as a standard bitmap such that wide data operations can access several horizontally adjacent pixels in one memory operation, the process of applying the filter to columns of 8 pixels vertically can be done efficiently by standard SIMD methodology, that is, applying an instruction to more than one column of pixels at one time. However, this type of memory organization is not suitable for performing the same operation on pixel data within the same row. Thus, in various embodiments, a pair of instructions are provided that enable a processor such as a SIMD media processor with 128-bit wide registers to perform the same filter operation on luma components of 8 adjacent pixels on a horizontal line without first re-ordering the data into columns in either the H.264 or VCl codec implementations.
[0132] An advantage of these block filter instructions over traditional SIMD techniques is that adjacent data elements within a row can be loaded into a vector register as in a typical column-based operation, but instead of performing the same operation on each slice, a dedicated data path is used to compute the entire horizontal computation without the need to first re-arrange the data in memory which would incur a high overhead.
[0133] Figure 23 depicts a pair of SIMD assembler instructions that are each pipelined to a single slot cycle with a three cycle latency for implementing the H.264 deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention. The instructions are formatted by name, output register VRa, first input register VRb and second input register VR0. Figure 24 illustrates the contents of the 128-bit register VRb containing the first input operand to both deblock instructions VH264FT and VH264F of Figure 23. The input comprises a 128-bit wide data consisting of a horizontal row of eight pixels (represented as 16-bit values) from two pixel blocks, that is a horizontal row of eight 16-bit pixels that span a block edge between the fourth and fifth pixels. The first input to each of these instructions is 8 16- bit luma values of 8 pixels in a row. Figure 25 illustrates the contents of a 128-bit register containing the second input operand VR0 to the deblock instruction VH264FT of Figure 23. Only the lower half of the 128-bit register is used. The lower 64-bits contain the H.264 filter threshold parameters alpha and beta, the strong flag and the filter strength CO. These parameters are derived directly from clauses 8.7.2.1 and 8.7.2.2 the H.264 specification. The H.264 Specification ITU-T Recommendation H.264 & ISO/IEC 14496-10 (MPEG-4) AVC, "Advanced Video Coding for Generic Audiovisual Services," Version 3:2005, and the VCl Specification, SMPTE 421M, "Proposed SMPTE Standard for Television: VC-I Compressed Video Bitstream Format and Decoding Process," August 23, 2005, are both hereby incorporated by reference in their entirety into this disclosure.
[0134] Figure 26 illustrates the contents of a 128-bit register VR3 containing the output of the first deblock instruction VH264FT split into eight 16-bit fields: C, beta, CO, Udelta, UpID, UqID and Flags, which are derived from the inputs in accordance with Table 1.1 below:
Table 1.1
Figure imgf000040_0001
Figure imgf000041_0001
The second instruction VH264F of Figure 23 takes the same 8 pixels input to the VH264FT instruction as its first input operand VRb and the output of the first instruction depicted in Figure 6 as the second input operand VR0. The output of the second instruction VH264F which is stored in destination register VRa is eight pixels PO, Pl, P2, P3, QO, Ql, Q2, and Q3 is calculated based on tables 1.2, 1.3, 1.4 and 1.5 below depending on the input conditions as follows:
Table 1.2
Figure imgf000042_0001
Table 1.3
Figure imgf000042_0002
Figure imgf000043_0001
Table 1.4
Figure imgf000044_0001
Table 1.5
Figure imgf000044_0002
It should be appreciated that in the H.264 codec, the VH264FT instruction is performed on each row of 4x8 block of 16 bit pixels. Then the result is applied to the same 4x8 block of pixels in the VH264F instruction. The 4x8 block of pixels comprises 8 pixels in each row, input in image order, that span across an edge between 2 pixel blocks. Figure 27 is a pixel diagram illustrating the 4x8 block of pixels as pixels from two adjacent blocks with a block edge between four pixels of blocks A and B in each row.
[0136] Figure 28 depicts a pair of single-cycle SIMD assembler instructions for implementing the VCl deblock filter operation on a horizontal line of pixels according to at least one embodiment of the invention. It should be appreciated that in contrast to the H.264 codec where the filter instructions according to the embodiments of the present invention are applied only the luma components, in the VCl codec, the filter instructions are applied to both the luma and chroma components. The instructions are formatted by name, output register VRa, first input register VRb and second input register VR0. Figure 29 illustrates the contents of the 128-bit register VRb containing the first input operand to the first deblock instruction WClFT of Figure 6. The operand comprises a 128-bit wide data consisting of a horizontal row of eight 16-bit pixels from two adjacent blocks, that is, pixels P1-P8. Figure 30 illustrates the contents of a 128-bit register containing the second input VR0 to the deblock instruction WClFT of Figure 27, in this case, just the VCl filter quantization parameter. Only one of the 16-bit portions of the register is used to store this value. This parameter is derived directly from section 8.6.4 of the VCl specification.
[0137] Figure 31 illustrates the output of the first deblock instruction WClFT in register VRa which in this case is comprised of five values PQ, aθ, a3, Clip and Aclip, derived from table 2.1 as follows:
Table 2.1
Figure imgf000045_0001
[0138] The second instruction VVClF also takes two input operands, VRb and VR0 which, contain the same pixel data input to the first instruction WClFT and the content of the output register of the first instruction respectively. The results of the second instruction VVClF are output to the destination register address specified by input VRa. The VCl instructions have a slightly different usage than the H.264 ones. The result is 8 pixels P1-P8 calculated according to Table 2.2 as follows:
Figure imgf000046_0001
[0139] The VCl test instruction is designed to be used in special order on a group of four registers. In the VCl codec, the WClFT instruction must be executed on the 3rd row first. If, based on this, it turns out that the other rows shouldn't be filtered, the PQ parameter is zeroed. This implies that d will also be zeroed, therefore, P4 and P5 won't change. However, WClFT still needs to be executed for the other rows to produce clip, aO and al, which are row specific.
[0140] Thus, through the use of the H.264 and VCl codec deblocking instructions disclosed above, significant performance gains are achieved by performing the horizontal filtering operation on adjacent pixel data in a row without the overhead of transposing the data into columns and then back into rows. Even for vertical filtering (at least for VCl), it is much faster to transpose the block, apply the deblock instructions, and transpose back again than doing the deblock without the special instructions.
[0141] As noted herein, the H.264 and VCl codec are two emerging video codec standards designed to facilitate high quality video required by today's electronic. H.264 was jointly developed by the Moving Picture Experts Group (MPEG) and the International Telecommunication Union (ITU). It is also known as MPEG-4 Part 10 Advanced Video Coding (AVC). VCl is a video codec specification based on MICROSOFT WINDOWS Media Video (WMV) 9 compression technology that is currently being standardized by the Society of Motion Picture and Television Engineers (SMPTE).
[0142] One key attribute of a video compression algorithm is the bit-rate of the compressed video stream. Codecs that target specific applications are designed to stay within the bit-rate constraints of these applications, while offering acceptable video quality. DVDs use 6-8 Mbps with MPEG-2 encoding. However, emerging digital video standards such as HDTV and HD-DVD can demand up to 20-40 Mbps using MPEG-2. Such high bit-rates translate into huge storage requirements for HD-DVDs and a limited number of HDTV channels. Thus, a key motivation for developing a new codec is to lower the bit-rate while preserving or even improving the video quality relative to MPEG -2. This was the motivation that led to the development of both the H.264 and VCl codecs. These codecs achieve significant advances in improving video quality and reducing bandwidth, but at the cost of greatly increased computational complexity at both the encoder and decoder.
[0143] Referring now to Figure 32, a schematic diagram illustrating a SIMD topology for performing homogeneously parallel mathematical operations is provided. In a typical SIMD-base system, the data path is divided into several identical lanes, each performing the same operation on different slices of the wide input data as required by the instruction being executed, such as the parallel data path 1000 in Figure 32. This Figure illustrates an example in which a typical SIMD machine is performing eight 16- bit additions on the N-bit inputs B and C to produce an N-bit output A. hi the case of an 8x16 SIMD unit having 8 16 bit data lanes, A, B and C are 128 bits in width, hi the Figure, we use the notation Kn to represent the slice K[(16n+15):16n] of the 128-bit data K[127:0], for example, B6 is the slice B[111 :96] of the input B. To implement filter operations using the type of SIMD machine depicted in Figure 1, the filter must be broken down into primitive operations such as additions and multiplications. All theses primitive operations must then be performed in parallel in each data lane of the machine, effectively performing several independent filter operations in parallel on different data sets. This type of parallelism may be characterized as homogeneous to the extend that each data lane is performing the same operation. In the example topology 1000 in Figure 32, eight such filter operations would be performed in parallel. Also all primitive operations would have to be done in the native precision supported by the SIMD machine — that is, the maximum width of each data lane, such as 16 bits. In cases where the filter operation requires intermediate computations to be done in a higher precision, for example 20-bit, than the native precision of the SEvID machine, for example, 16-bit, either the higher precision intermediate computation has to be emulated using multi-precision arithmetic routines, or the native precision of the machine must be widened. The former solution is less than ideal because multi- precision routines take extra cycles to execute and are often inefficiently supported in SIMD machines. Widening the native precision of the SIMD machine requires all architectural and internal data path elements to be made wider and is wasteful if only a small number of intermediate computations out of the entire application require the extra precision. Accordingly, various embodiments of the invention permit a SIMD machine to perform higher precision computations without the shortcomings of these solutions. Referring now to Figure 33, a diagram illustrating an array of pixels for describing an implementation of a filter-based interpolation method for performing sub-pixel interpolation in the H.264 video codec standard according to at least one embodiment of the invention is provided. The array of pixels 1200 illustrates the problem of inter- pixel interpolation. As noted herein, the H.264 codec specification allows for inter- pixel interpolation down to 1A and 1A pixel resolutions. In various embodiments of the invention, this process is performed at the SEVID processor level through a six-tap finite impulse response (FIR) filter. [0145] As noted herein, when performing motion estimation, in some cases, the previous block has moved a non-integer number of pixels from its previous location requiring interpolation to determine the pixel values at these non-integer locations, hi the array 200 of Figure 1, pixels A, B, C, D, E, F, G, H, I, J, K, L, M, N, P, Q, R, S, T and U are labeled in a horizontal and vertical configuration at actual pixel locations. Sub-pixels aa, bb, b, cc, dd, h, j, m, k, ee, ff, s, gg, and hh denote sub-pixel locations that are between two adjacent pixels in either the vertical, horizontal or diagonal directions.
[0146] hi the case where the sub-pixel is either horizontally or vertically between two integer pixel positions (e.g. b, h, m or s in Figure 33), the value of the interpolated pixel may be given by a six tap FIR filter applied to integer-position pixels with weights 1/32, - 5/32, 5/8, 5/8, -5/32 and 1/32. For example, the value of sub-pixel b can be calculated by applying the filter to adjacent integer-position pixels E, F, G, H, I and J, the six pixels surrounding and in the same row as the sub-pixel b. By applying the filter to these pixels, raw interpolated pixel b' is obtained, where b"= E - 5F + 2OG + 2OH - 51 + J. The interpolated pixel value b can then be obtained by b = clip((bΛ + 16)»5), where clip(a) means to clip the value to a range of 0-255. In applications where the pixels are represented by 8-bit unsigned values, the value of b can be calculated using 16-bit arithmetic.
[0147] For cases where the sub-pixel is diagonally between two integer pixel positions (e.g., pixel j in Figure 2), the filter is applied to the raw interpolated pixels of the closest sub- pixel positions as required by the H.264 standard. The raw interpolated pixels can be selected either horizontally or vertically since each produces the same result. If selected horizontally, the raw interpolated pixels used would be ccΛ, dd', h", m', eeλ and ff . These correspond to the pixels cc, dd, h, m, ee and ff in Figure 33 in the same way as b" was related to b above, i.e. the raw interpolated pixels are the filter outputs prior to shifting and clipping. If selected vertically, the raw interpolated pixels are aa\ bb\ b\ s\ gg" and hh" which correspond to aa, bb, b, s, gg and hh in Figure 2. Thus, interpolated pixel j is derived by the equation j = clip (((cc' - 5dd" + 2Oh" + 20mΛ - 5ee" + ff )+512)»10) or, alternatively j = clip (((aa" - 5bb" + 20b" + 20s" - 5gg" + hh")+512)»10). Because the raw interpolated pixels can have values ranging from 2550 to 10710, intermediate results of the filter function computation, (that is, the result of ∞ - 5d<f + 2OK + 20m' - 5eeΛ + ff ) can have values in the range -214200 to 475320 which can only be represented with a 20-bit number when the target sub-pixel is located diagonally between integer pixel positions.
[0148] For a SIMD processor using 16-bit precision arithmetic, it is inefficient to perform the necessary 20-bit arithmetic to implement the filter operation on pixels positioned diagonally between integer positions for the reasons discussed above in the context of Figure 32. Therefore, various embodiments of the invention overcome this inefficiency by providing hardware necessary for a single instruction to perform the entire filter operation. This instruction enables an 8x16-bit SIMD unit to handle the case in the H.264 standard that requires 20-bit math as part of the filter operation on the luma components of pixels. Thus, various embodiments of this invention significantly simplify and accelerate the implementation of the interpolation filter required for the sub-pixel motion vectors specified by the standards. The instruction takes as input 8 16-bit values that are positioned horizontally in a row and are results from vertical 6-tap filter operations previously performed. This input contains sufficient data for two adjacent filter operations to be performed. The two adjacent 6- tap filter operations concurrently using internal 20-bit arithmetic operations, and outputs 2 interpolated luma pixel values.
[0149] Given a vector containing elements [sθ, si, s2, s3, s4, s5, s6, s7], the instruction returns a vector containing two 16-bit values [rθ, rl]; r0 and rl may be computed as follows: r0 = (sθ - 5sl + 20s2 + 20s3 - 5s4 + s5 + 512)»10 rl = (si - 5s2 + 20 s3 + 20s4 - 5s5 + s6 + 512)»10
If the elements s0 to s6 contain adjacent raw vertically interpolated pixels, the results r0 and rl, after subsequent clipping, are two adjacent interpolated pixels that are diagonally positioned between integer pixel positions, such as pixels j and k of Figure 33. Thus, a relatively simple instruction is capable of performing the majority of processing necessary to produce these two interpolated pixels, hi the context of Figure 33, if j corresponds to clipped rθ, then sθ, si, s2, s3, s4 and s5 are the raw interpolated pixels corresponding to cc, dd, h, m, ee and ff respectively. Result rl would correspond to the interpolated pixel k between pixel m and pixel ee. The computation of rl would require a value for pixel s6 which is not shown in Figure 2.
[0150] The particular solution provided by the various embodiments of the invention may be illustrated by way of example in Figure 34. Figure 34 is a schematic diagram illustrating a SIMD topology for performing heterogeneously parallel mathematical operations according to at least one embodiment of the invention. In the Figure, two filter operations, such as the aforementioned sub-pixel interpolation filter required in modern video codecs, are performed concurrently. The outputs of these concurrent filter operations correspond to the values of two adjacent interpolated pixels that are at half-pixel diagonal positions, for example, sub-pixels j and k in Figure 33. In the parallel data path 1300 of Figure 34, each lane performs several operations that are specific to the lane. Intermediate results in all data lanes are then summed up and shifted as required to produce two 16-bit output results. These two results are distributed across all even and odd data lanes respectively and are written back to the corresponding slices of the destination register under the control of a mask mechanism.
[0151] Unlike the parallelism of the SIMD architecture of Figure 32, an important aspect of the embodiments of the current invention is the exploitation of heterogeneous parallelism, in which different data lanes perform different operations, in a SIMD machine that typically only exploits homogeneous parallelism, in which all data lanes perform the same operation. Another important aspect is the dedicated internal data path used to implement the required operations can be adjusted according to the required width of the intermediate computation. Since the input and output data are well within the native precision of the SMD machine, i.e. 16-bit, and only the intermediate computation requires extra precision, i.e. 20-bit, it is sufficient to only widen the dedicated internal data path while the rest of the SIMD pipeline still just needs to support the native precision it was designed for. Hence the wasteful widening of the entire SIMD pipeline is avoided.
[0152] Figure 35 is a flow chart of an exemplary method for accelerating sub-pixel interpolation according to at least one embodiment of the invention. The method begins in step 1400 and proceeds to step 1405 where the raw interpolated pixel input values are calculated. In various embodiments, when sub-pixels are located diagonally between integer pixel positions, a filter operation must first be performed to determine the value of the raw interpolated pixels located horizontally or vertically between integer pixels positions, in the same row or column as the target diagonal sub-pixels. Next, in step 1410, the intermediate raw interpolated sub-pixels are input into the SIMD data path for performing the interpolation calculation of two diagonally oriented sub-pixels. In various embodiments this is accomplished with a single instruction that inputs the values of 7 raw interpolated pixels allowing for two adjacent diagonally oriented pixels to be determined. Next, in step 1415, the intermediate filter calculations are performed in each data lane of the SIMD data path. In various embodiments, this comprises performing multiplication operations to derive the filter coefficients, e.g., - 5sl + 20s2 + 20s3 - 5s4, etc. Next, in step 1420, the intermediate results for each of the two sub-pixel calculations are summed across the data path. In step 1425, summation results are right shifted by 10-bits. hi various embodiments, the results are distributed across odd and even data lanes respectively. Then, in step 1430, the results of this operation are written back to corresponding slices of the destination register.
[0153] hi the method detailed in Figure 35, steps 1410 through 1430 are performed with a single instruction thereby accelerating the sub-pixel interpolation process significantly. Step 1405 is a pre-computation that is required as an input for the instruction according to the various embodiments of the invention. As a post-processing step not shown in the chart, the output of step 430 has to be clipped to produce the actual interpolated sub-pixel.
[0154] The embodiments of the present inventions are not to be limited in scope by the specific embodiments described herein. For example, although many of the embodiments disclosed herein have been described with reference to microprocessor- based systems including SEVID functions, systems, and methods, the principles herein are equally applicable to other aspects of microprocessor design and function. Indeed, various modifications of the embodiments of the present inventions, in addition to those described herein, will be apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such modifications are intended to fall within the scope of the following appended claims. Further, although some of the embodiments of the present invention have been described herein in the context of a particular implementation in a particular environment for a particular purpose, those of ordinary skill in the art will recognize that its usefulness is not limited thereto and that the embodiments of the present inventions can be beneficially implemented in any number of environments for any number of purposes. Accordingly, the claims set forth below should be construed in view of the full breath and spirit of the embodiments of the present inventions as disclosed herein

Claims

1. A microprocessor architecture comprising: a first processor instruction pipeline, comprising a front end portion and a rear portion; a second processor instruction pipeline, comprising a front end portion and a rear portion; and an instruction queue coupling the first and second instruction pipeline between their respective front end and rear portions.
2. A microprocessor architecture comprising: a main instruction pipeline; and an extended instruction pipeline, wherein the main instruction pipeline is configured to issue a begin record instruction to the extended instruction pipeline, causing the extended instruction pipeline to begin recording a sequence of instructions issued by the main instruction pipeline.
3. A method for synchronizing multiple processing engines of a microprocessor comprising: coupling an extended instruction pipeline to a main instruction pipeline; coupling direct memory access (DMA) engines to the extended instruction pipeline; buffering at least one instruction in the DMA engine, using a queue, without stopping the extended instruction pipeline; and blocking the extended instruction pipeline from further execution when a DMA engine instruction queue is full and a new DMA instruction arrives at the queue, until a current DMA operation is complete.
4. A method of performing block matching with a systolic array comprising: selecting an NxN target pixel block; selecting an NxN reference block from a starting point of an NxM reference block search space; propagating the target and reference blocks through N cycles to completely load the target and reference blocks into an array of systolic cells; computing a sum of absolute difference (SOAD) between a pixel of the target block and the reference block for each N rows of the array; saving the SOAD for the current reference block; incrementing to the next reference block of the NxM reference block search space and selecting a new NxN reference block; repeating the propagating, computing, saving and incrementing steps until all blocks in the reference search space have been tested; and selecting the block from the reference search space having the lowest SOAD as a motion vector for the target block.
5. A method of causing a microprocessor to perform a clip operation comprising: providing an assembly instruction to the microprocessor, the instruction comprising an input address, an output address and a controlling parameter; decoding the instruction with logic in the microprocessor; retrieving a data input from the input address; determining a specific clip operation based on the controlling parameter; performing the clip operation on the data input; and writing the result to output address.
6. A method of causing a microprocessor to perform a CODEC deblocking operation on a horizontal row of image pixels across a block edge comprising: providing a first instruction to the microprocessor having three 128-bit operands comprising: the 16-bit components of a horizontal row of pixels in a SUV image as a first input operand, wherein the horizontal row of pixels are in image order and include four pixels on either side of a pixel block edge; at least one filter threshold parameter as a second input operand; and a 128-bit destination operand for storing the output of the first instruction as a third operand; calculating an output value of the first instruction; storing the output value of the first instruction in the 128-bit destination register; providing a second instruction to the microprocessor having three 128-bit operands comprising: the first input operand of the first instruction as the first input operand; the output of the first instruction as a second input operand; and a destination operand of a 128-bit register for storing an output of the second instruction as the third operand; calculating an output value of the second instruction; and storing the output value in the 128-bit register specified by the destination operand of the second instruction.
7. A data path for performing heterogeneous arithmetic interpolation operations in a SIMD processor comprising: a set of equal sized parallel data lanes adapted to: receive data inputs in the form of interpolated pixel values; perform simultaneous, lane specific operations on the data inputs, to derive intermediate values; and to sum the intennediate values, perform right shift to the sum and distribute the results across multiple data lanes.
PCT/IB2006/003358 2005-09-28 2006-09-28 Architecture for microprocessor-based systems including simd processing unit and associated systems and methods WO2007049150A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US72110805P 2005-09-28 2005-09-28
US60/721,108 2005-09-28

Publications (2)

Publication Number Publication Date
WO2007049150A2 true WO2007049150A2 (en) 2007-05-03
WO2007049150A3 WO2007049150A3 (en) 2007-12-27

Family

ID=37968194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/003358 WO2007049150A2 (en) 2005-09-28 2006-09-28 Architecture for microprocessor-based systems including simd processing unit and associated systems and methods

Country Status (2)

Country Link
US (7) US7971042B2 (en)
WO (1) WO2007049150A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
WO2018186918A1 (en) * 2017-04-03 2018-10-11 Google Llc Vector reduction processor

Families Citing this family (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9015397B2 (en) 2012-11-29 2015-04-21 Sandisk Technologies Inc. Method and apparatus for DMA transfer with synchronization optimization
US9330060B1 (en) 2003-04-15 2016-05-03 Nvidia Corporation Method and device for encoding and decoding video image data
US8660182B2 (en) 2003-06-09 2014-02-25 Nvidia Corporation MPEG motion estimation based on dual start points
TWI239474B (en) * 2004-07-28 2005-09-11 Novatek Microelectronics Corp Circuit for counting sum of absolute difference
TWI295540B (en) * 2005-06-15 2008-04-01 Novatek Microelectronics Corp Motion estimation circuit and operating method thereof
TWI296091B (en) * 2005-11-15 2008-04-21 Novatek Microelectronics Corp Motion estimation circuit and motion estimation processing element
US8731071B1 (en) 2005-12-15 2014-05-20 Nvidia Corporation System for performing finite input response (FIR) filtering in motion estimation
US20070217515A1 (en) * 2006-03-15 2007-09-20 Yu-Jen Wang Method for determining a search pattern for motion estimation
US8724702B1 (en) 2006-03-29 2014-05-13 Nvidia Corporation Methods and systems for motion estimation used in video coding
US8660380B2 (en) 2006-08-25 2014-02-25 Nvidia Corporation Method and system for performing two-dimensional transform on data value array with reduced power consumption
US9094686B2 (en) * 2006-09-06 2015-07-28 Broadcom Corporation Systems and methods for faster throughput for compressed video data decoding
KR101354659B1 (en) * 2006-11-08 2014-01-28 삼성전자주식회사 Method and apparatus for motion compensation supporting multicodec
US7958177B2 (en) * 2006-11-29 2011-06-07 Arcsoft, Inc. Method of parallelly filtering input data words to obtain final output data words containing packed half-pel pixels
US8756482B2 (en) 2007-05-25 2014-06-17 Nvidia Corporation Efficient encoding/decoding of a sequence of data frames
US9118927B2 (en) * 2007-06-13 2015-08-25 Nvidia Corporation Sub-pixel interpolation and its application in motion compensated encoding of a video signal
US8873625B2 (en) 2007-07-18 2014-10-28 Nvidia Corporation Enhanced compression in representing non-frame-edge blocks of image frames
US8634470B2 (en) * 2007-07-24 2014-01-21 Samsung Electronics Co., Ltd. Multimedia decoding method and multimedia decoding apparatus based on multi-core processor
JP2009054032A (en) * 2007-08-28 2009-03-12 Toshiba Corp Parallel processor
JP5159258B2 (en) * 2007-11-06 2013-03-06 株式会社東芝 Arithmetic processing unit
US8437410B1 (en) 2007-11-21 2013-05-07 Marvell International Ltd. System and method to execute a clipping instruction
US20090188521A1 (en) * 2008-01-17 2009-07-30 Evazynajad Ali M Dental Floss Formed from Botanic and Botanically Derived Fiber
US8726289B2 (en) * 2008-02-22 2014-05-13 International Business Machines Corporation Streaming attachment of hardware accelerators to computer systems
US8250578B2 (en) * 2008-02-22 2012-08-21 International Business Machines Corporation Pipelining hardware accelerators to computer systems
US7953912B2 (en) * 2008-02-22 2011-05-31 International Business Machines Corporation Guided attachment of accelerators to computer systems
EP2324430A4 (en) 2008-08-06 2012-07-25 Aspen Acquisition Corp Haltable and restartable dma engine
US8386547B2 (en) 2008-10-31 2013-02-26 Intel Corporation Instruction and logic for performing range detection
US9179166B2 (en) * 2008-12-05 2015-11-03 Nvidia Corporation Multi-protocol deblock engine core system and method
US8666181B2 (en) 2008-12-10 2014-03-04 Nvidia Corporation Adaptive multiple engine image motion detection system and method
US20100180100A1 (en) * 2009-01-13 2010-07-15 Mavrix Technology, Inc. Matrix microprocessor and method of operation
CN102055969B (en) * 2009-10-30 2012-12-19 鸿富锦精密工业(深圳)有限公司 Image deblocking filter and image processing device using same
US9390539B2 (en) * 2009-11-04 2016-07-12 Intel Corporation Performing parallel shading operations
CN103502935B (en) 2011-04-01 2016-10-12 英特尔公司 The friendly instruction format of vector and execution thereof
TWI449433B (en) * 2011-08-01 2014-08-11 Novatek Microelectronics Corp Image processing circuit and image processing method
CN102346769B (en) * 2011-09-20 2014-10-22 奇智软件(北京)有限公司 Method and device for consolidating registry file
US10157061B2 (en) * 2011-12-22 2018-12-18 Intel Corporation Instructions for storing in general purpose registers one of two scalar constants based on the contents of vector write masks
US9389861B2 (en) * 2011-12-22 2016-07-12 Intel Corporation Systems, apparatuses, and methods for mapping a source operand to a different range
CN104025021A (en) * 2011-12-23 2014-09-03 英特尔公司 Apparatus and method for sliding window data gather
US9152424B2 (en) * 2012-06-14 2015-10-06 International Business Machines Corporation Mitigating instruction prediction latency with independently filtered presence predictors
US9241163B2 (en) * 2013-03-15 2016-01-19 Intersil Americas LLC VC-2 decoding using parallel decoding paths
US11228769B2 (en) * 2013-06-03 2022-01-18 Texas Instruments Incorporated Multi-threading in a video hardware engine
US9330022B2 (en) * 2013-06-25 2016-05-03 Intel Corporation Power logic for memory address conversion
JP6262621B2 (en) * 2013-09-25 2018-01-17 株式会社メガチップス Image enlargement / reduction processing apparatus and image enlargement / reduction processing method
US9547493B2 (en) * 2013-10-03 2017-01-17 Synopsys, Inc. Self-timed user-extension instructions for a processing device
GB2524063B (en) 2014-03-13 2020-07-01 Advanced Risc Mach Ltd Data processing apparatus for executing an access instruction for N threads
US20160125263A1 (en) * 2014-11-03 2016-05-05 Texas Instruments Incorporated Method to compute sliding window block sum using instruction based selective horizontal addition in vector processor
KR102332523B1 (en) * 2014-12-24 2021-11-29 삼성전자주식회사 Apparatus and method for execution processing
US9715464B2 (en) 2015-03-27 2017-07-25 Microsoft Technology Licensing, Llc Direct memory access descriptor processing
US10034407B2 (en) * 2016-07-22 2018-07-24 Intel Corporation Storage sled for a data center
GB2563384B (en) * 2017-06-07 2019-12-25 Advanced Risc Mach Ltd Programmable instruction buffering
US10437740B2 (en) * 2017-12-15 2019-10-08 Exten Technologies, Inc. High performance raid operations offload with minimized local buffering
EP4221229A1 (en) * 2018-09-24 2023-08-02 Huawei Technologies Co., Ltd. Image processing device and method for performing quality optimized deblocking
US11099973B2 (en) * 2019-01-28 2021-08-24 Salesforce.Com, Inc. Automated test case management systems and methods
US20220405221A1 (en) * 2019-07-03 2022-12-22 Huaxia General Processor Technologies Inc. System and architecture of pure functional neural network accelerator
KR20220015680A (en) 2020-07-31 2022-02-08 삼성전자주식회사 Method and apparatus for performing deep learning operations
US11880231B2 (en) * 2020-12-14 2024-01-23 Microsoft Technology Licensing, Llc Accurate timestamp or derived counter value generation on a complex CPU
CN113312088B (en) * 2021-06-29 2022-05-17 北京熵核科技有限公司 Method and device for executing program instruction
US11567775B1 (en) * 2021-10-25 2023-01-31 Sap Se Dynamic generation of logic for computing systems
WO2023235004A1 (en) * 2022-06-02 2023-12-07 Micron Technology, Inc. Time-division multiplexed simd function unit

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993008526A1 (en) * 1991-10-21 1993-04-29 Intel Corporation Cross coupling mechanisms for microprocessor instructions using pipelining systems
US5884057A (en) * 1994-01-11 1999-03-16 Exponential Technology, Inc. Temporal re-alignment of a floating point pipeline to an integer pipeline for emulation of a load-operate architecture on a load/store processor
GB2365583B (en) * 2000-02-18 2004-08-04 Hewlett Packard Co Pipeline decoupling buffer for handling early data and late data

Family Cites Families (218)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4594659A (en) * 1982-10-13 1986-06-10 Honeywell Information Systems Inc. Method and apparatus for prefetching instructions for a central execution pipeline unit
JPS63225822A (en) * 1986-08-11 1988-09-20 Toshiba Corp Barrel shifter
US4905178A (en) * 1986-09-19 1990-02-27 Performance Semiconductor Corporation Fast shifter method and structure
JPS6398729A (en) * 1986-10-15 1988-04-30 Fujitsu Ltd Barrel shifter
US4914622A (en) * 1987-04-17 1990-04-03 Advanced Micro Devices, Inc. Array-organized bit map with a barrel shifter
EP0304948B1 (en) 1987-08-28 1994-06-01 Nec Corporation Data processor including testing structure for barrel shifter
KR970005453B1 (en) 1987-12-25 1997-04-16 가부시기가이샤 히다찌세이사꾸쇼 Data processing apparatus for high speed processing
US4926323A (en) * 1988-03-03 1990-05-15 Advanced Micro Devices, Inc. Streamlined instruction processor
JPH01263820A (en) * 1988-04-15 1989-10-20 Hitachi Ltd Microprocessor
EP0344347B1 (en) 1988-06-02 1993-12-29 Deutsche ITT Industries GmbH Digital signal processing unit
GB2229832B (en) 1989-03-30 1993-04-07 Intel Corp Byte swap instruction for memory format conversion within a microprocessor
JPH03185530A (en) 1989-12-14 1991-08-13 Mitsubishi Electric Corp Data processor
EP0436341B1 (en) * 1990-01-02 1997-05-07 Motorola, Inc. Sequential prefetch method for 1, 2 or 3 word instructions
JPH03248226A (en) * 1990-02-26 1991-11-06 Nec Corp Microprocessor
JP2560889B2 (en) 1990-05-22 1996-12-04 日本電気株式会社 Microprocessor
US5155843A (en) 1990-06-29 1992-10-13 Digital Equipment Corporation Error transition mode for multi-processor system
US5778423A (en) 1990-06-29 1998-07-07 Digital Equipment Corporation Prefetch instruction for improving performance in reduced instruction set processor
CA2045790A1 (en) * 1990-06-29 1991-12-30 Richard Lee Sites Branch prediction in high-performance processor
JP2556612B2 (en) 1990-08-29 1996-11-20 日本電気アイシーマイコンシステム株式会社 Barrel shifter circuit
US5636363A (en) * 1991-06-14 1997-06-03 Integrated Device Technology, Inc. Hardware control structure and method for off-chip monitoring entries of an on-chip cache
US5539911A (en) 1991-07-08 1996-07-23 Seiko Epson Corporation High-performance, superscalar-based computer system with out-of-order instruction execution
US5493687A (en) * 1991-07-08 1996-02-20 Seiko Epson Corporation RISC microprocessor architecture implementing multiple typed register sets
US5450586A (en) 1991-08-14 1995-09-12 Hewlett-Packard Company System for analyzing and debugging embedded software through dynamic and interactive use of code markers
CA2073516A1 (en) 1991-11-27 1993-05-28 Peter Michael Kogge Dynamic multi-mode parallel processor array architecture computer system
FR2690299B1 (en) * 1992-04-17 1994-06-17 Telecommunications Sa METHOD AND DEVICE FOR SPATIAL FILTERING OF DIGITAL IMAGES DECODED BY BLOCK TRANSFORMATION.
US5423011A (en) * 1992-06-11 1995-06-06 International Business Machines Corporation Apparatus for initializing branch prediction information
US5542074A (en) 1992-10-22 1996-07-30 Maspar Computer Corporation Parallel processor system with highly flexible local control capability, including selective inversion of instruction signal and control of bit shift amount
US5696958A (en) 1993-01-11 1997-12-09 Silicon Graphics, Inc. Method and apparatus for reducing delays following the execution of a branch instruction in an instruction pipeline
GB2275119B (en) 1993-02-03 1997-05-14 Motorola Inc A cached processor
US5937202A (en) 1993-02-11 1999-08-10 3-D Computing, Inc. High-speed, parallel, processor architecture for front-end electronics, based on a single type of ASIC, and method use thereof
US5454117A (en) 1993-08-25 1995-09-26 Nexgen, Inc. Configurable branch prediction for a processor performing speculative execution
JP2801135B2 (en) * 1993-11-26 1998-09-21 富士通株式会社 Instruction reading method and instruction reading device for pipeline processor
US5590350A (en) * 1993-11-30 1996-12-31 Texas Instruments Incorporated Three input arithmetic logic unit with mask generator
US6116768A (en) 1993-11-30 2000-09-12 Texas Instruments Incorporated Three input arithmetic logic unit with barrel rotator
US5509129A (en) * 1993-11-30 1996-04-16 Guttag; Karl M. Long instruction word controlling plural independent processor operations
US5590351A (en) 1994-01-21 1996-12-31 Advanced Micro Devices, Inc. Superscalar execution unit for sequential instruction pointer updates and segment limit checks
JPH07253922A (en) * 1994-03-14 1995-10-03 Texas Instr Japan Ltd Address generating circuit
US5530825A (en) * 1994-04-15 1996-06-25 Motorola, Inc. Data processor with branch target address cache and method of operation
US5517436A (en) * 1994-06-07 1996-05-14 Andreas; David C. Digital signal processor for audio applications
JP2000511363A (en) 1994-07-14 2000-08-29 ジョンソン、グレイス、カンパニー Method and apparatus for compressing images
US5809293A (en) 1994-07-29 1998-09-15 International Business Machines Corporation System and method for program execution tracing within an integrated processor
US5692168A (en) 1994-10-18 1997-11-25 Cyrix Corporation Prefetch buffer using flow control bit to identify changes of flow within the code stream
US5600674A (en) * 1995-03-02 1997-02-04 Motorola Inc. Method and apparatus of an enhanced digital signal processor
US5655122A (en) 1995-04-05 1997-08-05 Sequent Computer Systems, Inc. Optimizing compiler with static prediction of branch probability, branch frequency and function frequency
US5835753A (en) 1995-04-12 1998-11-10 Advanced Micro Devices, Inc. Microprocessor with dynamically extendable pipeline stages and a classifying circuit
US5920711A (en) 1995-06-02 1999-07-06 Synopsys, Inc. System for frame-based protocol, graphical capture, synthesis, analysis, and simulation
US5842004A (en) 1995-08-04 1998-11-24 Sun Microsystems, Inc. Method and apparatus for decompression of compressed geometric three-dimensional graphics data
US6292879B1 (en) 1995-10-25 2001-09-18 Anthony S. Fong Method and apparatus to specify access control list and cache enabling and cache coherency requirement enabling on individual operands of an instruction of a computer
US5727211A (en) * 1995-11-09 1998-03-10 Chromatic Research, Inc. System and method for fast context switching between tasks
US5996071A (en) 1995-12-15 1999-11-30 Via-Cyrix, Inc. Detecting self-modifying code in a pipelined processor with branch processing by comparing latched store address to subsequent target address
US5896305A (en) * 1996-02-08 1999-04-20 Texas Instruments Incorporated Shifter circuit for an arithmetic logic unit in a microprocessor
US5752014A (en) * 1996-04-29 1998-05-12 International Business Machines Corporation Automatic selection of branch prediction methodology for subsequent branch instruction based on outcome of previous branch prediction
US5784636A (en) 1996-05-28 1998-07-21 National Semiconductor Corporation Reconfigurable computer architecture for use in signal processing applications
US20010025337A1 (en) 1996-06-10 2001-09-27 Frank Worrell Microprocessor including a mode detector for setting compression mode
US5805876A (en) 1996-09-30 1998-09-08 International Business Machines Corporation Method and system for reducing average branch resolution time and effective misprediction penalty in a processor
US5964884A (en) 1996-09-30 1999-10-12 Advanced Micro Devices, Inc. Self-timed pulse control circuit
US5848264A (en) 1996-10-25 1998-12-08 S3 Incorporated Debug and video queue for multi-processor chip
US5909572A (en) 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
US6061521A (en) * 1996-12-02 2000-05-09 Compaq Computer Corp. Computer having multimedia operations executable as two distinct sets of operations within a single instruction cycle
EP0855645A3 (en) * 1996-12-31 2000-05-24 Texas Instruments Incorporated System and method for speculative execution of instructions with data prefetch
KR100236533B1 (en) * 1997-01-16 2000-01-15 윤종용 Digital signal processor
US6154857A (en) 1997-04-08 2000-11-28 Advanced Micro Devices, Inc. Microprocessor-based device incorporating a cache for capturing software performance profiling data
US6185732B1 (en) * 1997-04-08 2001-02-06 Advanced Micro Devices, Inc. Software debug port for a microprocessor
US6088786A (en) 1997-06-27 2000-07-11 Sun Microsystems, Inc. Method and system for coupling a stack based processor to register based functional unit
US6760833B1 (en) 1997-08-01 2004-07-06 Micron Technology, Inc. Split embedded DRAM processor
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
US6026478A (en) * 1997-08-01 2000-02-15 Micron Technology, Inc. Split embedded DRAM processor
US6157988A (en) * 1997-08-01 2000-12-05 Micron Technology, Inc. Method and apparatus for high performance branching in pipelined microsystems
JPH1185515A (en) 1997-09-10 1999-03-30 Ricoh Co Ltd Microprocessor
US5923892A (en) * 1997-10-27 1999-07-13 Levy; Paul S. Host processor and coprocessor arrangement for processing platform-independent code
US5978909A (en) 1997-11-26 1999-11-02 Intel Corporation System for speculative branch target prediction having a dynamic prediction history buffer and a static prediction history buffer
US6044458A (en) * 1997-12-12 2000-03-28 Motorola, Inc. System for monitoring program flow utilizing fixwords stored sequentially to opcodes
US6014743A (en) * 1998-02-05 2000-01-11 Intergrated Device Technology, Inc. Apparatus and method for recording a floating point error pointer in zero cycles
US6151672A (en) 1998-02-23 2000-11-21 Hewlett-Packard Company Methods and apparatus for reducing interference in a branch history table of a microprocessor
US6374349B2 (en) 1998-03-19 2002-04-16 Mcfarling Scott Branch predictor with serially connected predictor stages for improving branch prediction accuracy
US6377970B1 (en) * 1998-03-31 2002-04-23 Intel Corporation Method and apparatus for computing a sum of packed data elements using SIMD multiply circuitry
US6584585B1 (en) 1998-05-08 2003-06-24 Gateway, Inc. Virtual device driver and methods employing the same
US6289417B1 (en) 1998-05-18 2001-09-11 Arm Limited Operand supply to an execution unit
US6466333B2 (en) 1998-06-26 2002-10-15 Canon Kabushiki Kaisha Streamlined tetrahedral interpolation
US20020053015A1 (en) * 1998-07-14 2002-05-02 Sony Corporation And Sony Electronics Inc. Digital signal processor particularly suited for decoding digital audio
US6327651B1 (en) 1998-09-08 2001-12-04 International Business Machines Corporation Wide shifting in the vector permute unit
US6253287B1 (en) * 1998-09-09 2001-06-26 Advanced Micro Devices, Inc. Using three-dimensional storage to make variable-length instructions appear uniform in two dimensions
US6339822B1 (en) * 1998-10-02 2002-01-15 Advanced Micro Devices, Inc. Using padded instructions in a block-oriented cache
US6671743B1 (en) 1998-11-13 2003-12-30 Creative Technology, Ltd. Method and system for exposing proprietary APIs in a privileged device driver to an application
US6529930B1 (en) * 1998-11-16 2003-03-04 Hitachi America, Ltd. Methods and apparatus for performing a signed saturation operation
US6189091B1 (en) * 1998-12-02 2001-02-13 Ip First, L.L.C. Apparatus and method for speculatively updating global history and restoring same on branch misprediction detection
US6341348B1 (en) 1998-12-03 2002-01-22 Sun Microsystems, Inc. Software branch prediction filtering for a microprocessor
US6957327B1 (en) * 1998-12-31 2005-10-18 Stmicroelectronics, Inc. Block-based branch target buffer
US6477683B1 (en) 1999-02-05 2002-11-05 Tensilica, Inc. Automated processor generation system for designing a configurable processor and method for the same
US6418530B2 (en) 1999-02-18 2002-07-09 Hewlett-Packard Company Hardware/software system for instruction profiling and trace selection using branch history information for branch predictions
US6757019B1 (en) 1999-03-13 2004-06-29 The Board Of Trustees Of The Leland Stanford Junior University Low-power parallel processor and imager having peripheral control circuitry
US6499101B1 (en) 1999-03-18 2002-12-24 I.P. First L.L.C. Static branch prediction mechanism for conditional branch instructions
US6427206B1 (en) 1999-05-03 2002-07-30 Intel Corporation Optimized branch predictions for strongly predicted compiler branches
US6560754B1 (en) * 1999-05-13 2003-05-06 Arc International Plc Method and apparatus for jump control in a pipelined processor
US6622240B1 (en) 1999-06-18 2003-09-16 Intrinsity, Inc. Method and apparatus for pre-branch instruction
US6518974B2 (en) 1999-07-16 2003-02-11 Intel Corporation Pixel engine
JP2001034504A (en) * 1999-07-19 2001-02-09 Mitsubishi Electric Corp Source level debugger
US6772325B1 (en) 1999-10-01 2004-08-03 Hitachi, Ltd. Processor architecture and operation for exploiting improved branch control instruction
US6546481B1 (en) 1999-11-05 2003-04-08 Ip - First Llc Split history tables for branch prediction
US7072398B2 (en) * 2000-12-06 2006-07-04 Kai-Kuang Ma System and method for motion vector generation and analysis of digital video clips
US6609194B1 (en) 1999-11-12 2003-08-19 Ip-First, Llc Apparatus for performing branch target address calculation based on branch type
US6909744B2 (en) 1999-12-09 2005-06-21 Redrock Semiconductor, Inc. Processor architecture for compression and decompression of video and images
KR100395763B1 (en) 2000-02-01 2003-08-25 삼성전자주식회사 A branch predictor for microprocessor having multiple processes
US6412038B1 (en) 2000-02-14 2002-06-25 Intel Corporation Integral modular cache for a processor
WO2001063434A1 (en) * 2000-02-24 2001-08-30 Bops, Incorporated Methods and apparatus for dual-use coprocessing/debug interface
US6519696B1 (en) * 2000-03-30 2003-02-11 I.P. First, Llc Paired register exchange using renaming register map
US6876703B2 (en) * 2000-05-11 2005-04-05 Ub Video Inc. Method and apparatus for video coding
US7079579B2 (en) * 2000-07-13 2006-07-18 Samsung Electronics Co., Ltd. Block matching processor and method for block matching motion estimation in video compression
US6681295B1 (en) * 2000-08-31 2004-01-20 Hewlett-Packard Development Company, L.P. Fast lane prefetching
US6718460B1 (en) * 2000-09-05 2004-04-06 Sun Microsystems, Inc. Mechanism for error handling in a computer system
US20020065860A1 (en) * 2000-10-04 2002-05-30 Grisenthwaite Richard Roy Data processing apparatus and method for saturating data values
US20030070013A1 (en) * 2000-10-27 2003-04-10 Daniel Hansson Method and apparatus for reducing power consumption in a digital processor
US6948054B2 (en) * 2000-11-29 2005-09-20 Lsi Logic Corporation Simple branch prediction and misprediction recovery method
KR100386639B1 (en) * 2000-12-04 2003-06-02 주식회사 오픈비주얼 Method for decompression of images and video using regularized dequantizer
TW477954B (en) * 2000-12-05 2002-03-01 Faraday Tech Corp Memory data accessing architecture and method for a processor
US20020073301A1 (en) * 2000-12-07 2002-06-13 International Business Machines Corporation Hardware for use with compiler generated branch information
US7139903B2 (en) 2000-12-19 2006-11-21 Hewlett-Packard Development Company, L.P. Conflict free parallel read access to a bank interleaved branch predictor in a processor
US6963554B1 (en) 2000-12-27 2005-11-08 National Semiconductor Corporation Microwire dynamic sequencer pipeline stall
US6877089B2 (en) 2000-12-27 2005-04-05 International Business Machines Corporation Branch prediction apparatus and process for restoring replaced branch history for use in future branch predictions for an executing program
US8285976B2 (en) 2000-12-28 2012-10-09 Micron Technology, Inc. Method and apparatus for predicting branches using a meta predictor
US20020087851A1 (en) 2000-12-28 2002-07-04 Matsushita Electric Industrial Co., Ltd. Microprocessor and an instruction converter
US6925634B2 (en) 2001-01-24 2005-08-02 Texas Instruments Incorporated Method for maintaining cache coherency in software in a shared memory system
US7039901B2 (en) 2001-01-24 2006-05-02 Texas Instruments Incorporated Software shared memory bus
US6823447B2 (en) 2001-03-01 2004-11-23 International Business Machines Corporation Software hint to improve the branch target prediction accuracy
EP1381957A2 (en) 2001-03-02 2004-01-21 Atsana Semiconductor Corp. Data processing apparatus and system and method for controlling memory access
JP3890910B2 (en) 2001-03-21 2007-03-07 株式会社日立製作所 Instruction execution result prediction device
US7010558B2 (en) * 2001-04-19 2006-03-07 Arc International Data processor with enhanced instruction execution and method
US20020194461A1 (en) 2001-05-04 2002-12-19 Ip First Llc Speculative branch target address cache
US7165168B2 (en) 2003-01-14 2007-01-16 Ip-First, Llc Microprocessor with branch target address cache update queue
US6886093B2 (en) 2001-05-04 2005-04-26 Ip-First, Llc Speculative hybrid branch direction predictor
US7200740B2 (en) 2001-05-04 2007-04-03 Ip-First, Llc Apparatus and method for speculatively performing a return instruction in a microprocessor
US7165169B2 (en) 2001-05-04 2007-01-16 Ip-First, Llc Speculative branch target address cache with selective override by secondary predictor based on branch instruction type
US20020194462A1 (en) 2001-05-04 2002-12-19 Ip First Llc Apparatus and method for selecting one of multiple target addresses stored in a speculative branch target address cache per instruction cache line
GB0112275D0 (en) 2001-05-21 2001-07-11 Micron Technology Inc Method and circuit for normalization of floating point significands in a simd array mpp
GB0112269D0 (en) * 2001-05-21 2001-07-11 Micron Technology Inc Method and circuit for alignment of floating point significands in a simd array mpp
US6950929B2 (en) 2001-05-24 2005-09-27 Samsung Electronics Co., Ltd. Loop instruction processing using loop buffer in a data processing device having a coprocessor
EP1405174A1 (en) 2001-06-29 2004-04-07 Koninklijke Philips Electronics N.V. Method, apparatus and compiler for predicting indirect branch target addresses
US6823444B1 (en) 2001-07-03 2004-11-23 Ip-First, Llc Apparatus and method for selectively accessing disparate instruction buffer stages based on branch target address cache hit and instruction stage wrap
US7162619B2 (en) * 2001-07-03 2007-01-09 Ip-First, Llc Apparatus and method for densely packing a branch instruction predicted by a branch target address cache and associated target instructions into a byte-wide instruction buffer
JP4145586B2 (en) 2001-07-24 2008-09-03 セイコーエプソン株式会社 Image processing apparatus, image processing program, and image processing method
US7010675B2 (en) * 2001-07-27 2006-03-07 Stmicroelectronics, Inc. Fetch branch architecture for reducing branch penalty without branch prediction
US7191445B2 (en) * 2001-08-31 2007-03-13 Texas Instruments Incorporated Method using embedded real-time analysis components with corresponding real-time operating system software objects
JP2003131902A (en) 2001-10-24 2003-05-09 Toshiba Corp Software debugger, system-level debugger, debug method and debug program
US20040054877A1 (en) * 2001-10-29 2004-03-18 Macy William W. Method and apparatus for shuffling data
US7272622B2 (en) 2001-10-29 2007-09-18 Intel Corporation Method and apparatus for parallel shift right merge of data
US7685212B2 (en) 2001-10-29 2010-03-23 Intel Corporation Fast full search motion estimation with SIMD merge instruction
US7051239B2 (en) 2001-12-28 2006-05-23 Hewlett-Packard Development Company, L.P. Method and apparatus for efficiently implementing trace and/or logic analysis mechanisms on a processor chip
CN1625731A (en) 2002-01-31 2005-06-08 Arc国际公司 Configurable data processor with multi-length instruction set architecture
US7168067B2 (en) 2002-02-08 2007-01-23 Agere Systems Inc. Multiprocessor system with cache-based software breakpoints
US7529912B2 (en) 2002-02-12 2009-05-05 Via Technologies, Inc. Apparatus and method for instruction-level specification of floating point format
US7181596B2 (en) * 2002-02-12 2007-02-20 Ip-First, Llc Apparatus and method for extending a microprocessor instruction set
US7328328B2 (en) 2002-02-19 2008-02-05 Ip-First, Llc Non-temporal memory reference control mechanism
US7315921B2 (en) 2002-02-19 2008-01-01 Ip-First, Llc Apparatus and method for selective memory attribute control
US7395412B2 (en) 2002-03-08 2008-07-01 Ip-First, Llc Apparatus and method for extending data modes in a microprocessor
US7546446B2 (en) 2002-03-08 2009-06-09 Ip-First, Llc Selective interrupt suppression
US7180943B1 (en) 2002-03-26 2007-02-20 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Compression of a data stream by selection among a set of compression tools
US7155598B2 (en) 2002-04-02 2006-12-26 Ip-First, Llc Apparatus and method for conditional instruction execution
US7302551B2 (en) 2002-04-02 2007-11-27 Ip-First, Llc Suppression of store checking
US7373483B2 (en) 2002-04-02 2008-05-13 Ip-First, Llc Mechanism for extending the number of registers in a microprocessor
US7380103B2 (en) 2002-04-02 2008-05-27 Ip-First, Llc Apparatus and method for selective control of results write back
US7185180B2 (en) 2002-04-02 2007-02-27 Ip-First, Llc Apparatus and method for selective control of condition code write back
US20030198295A1 (en) * 2002-04-12 2003-10-23 Liang-Gee Chen Global elimination algorithm for motion estimation and the hardware architecture thereof
US7380109B2 (en) 2002-04-15 2008-05-27 Ip-First, Llc Apparatus and method for providing extended address modes in an existing instruction set for a microprocessor
US20030204705A1 (en) 2002-04-30 2003-10-30 Oldfield William H. Prediction of branch instructions in a data processing apparatus
KR100450753B1 (en) 2002-05-17 2004-10-01 한국전자통신연구원 Programmable variable length decoder including interface of CPU processor
US6938151B2 (en) 2002-06-04 2005-08-30 International Business Machines Corporation Hybrid branch prediction using a global selection counter and a prediction method comparison table
US6718504B1 (en) * 2002-06-05 2004-04-06 Arc International Method and apparatus for implementing a data processor adapted for turbo decoding
US7493480B2 (en) * 2002-07-18 2009-02-17 International Business Machines Corporation Method and apparatus for prefetching branch history information
US7392368B2 (en) * 2002-08-09 2008-06-24 Marvell International Ltd. Cross multiply and add instruction and multiply and subtract instruction SIMD execution on real and imaginary components of a plurality of complex data elements
US7000095B2 (en) * 2002-09-06 2006-02-14 Mips Technologies, Inc. Method and apparatus for clearing hazards using jump instructions
WO2004030369A1 (en) * 2002-09-27 2004-04-08 Videosoft, Inc. Real-time video coding/decoding
US20050125634A1 (en) 2002-10-04 2005-06-09 Fujitsu Limited Processor and instruction control method
US6968444B1 (en) 2002-11-04 2005-11-22 Advanced Micro Devices, Inc. Microprocessor employing a fixed position dispatch unit
US7227901B2 (en) * 2002-11-21 2007-06-05 Ub Video Inc. Low-complexity deblocking filter
US8667252B2 (en) * 2002-11-21 2014-03-04 Stmicroelectronics, Inc. Method and apparatus to adapt the clock rate of a programmable coprocessor for optimal performance and power dissipation
US7266676B2 (en) 2003-03-21 2007-09-04 Analog Devices, Inc. Method and apparatus for branch prediction based on branch targets utilizing tag and data arrays
US6774832B1 (en) 2003-03-25 2004-08-10 Raytheon Company Multi-bit output DDS with real time delta sigma modulation look up from memory
US7174444B2 (en) 2003-03-31 2007-02-06 Intel Corporation Preventing a read of a next sequential chunk in branch prediction of a subject chunk
US20040193855A1 (en) 2003-03-31 2004-09-30 Nicolas Kacevas System and method for branch prediction access
US7590829B2 (en) 2003-03-31 2009-09-15 Stretch, Inc. Extension adapter
US20040225870A1 (en) 2003-05-07 2004-11-11 Srinivasan Srikanth T. Method and apparatus for reducing wrong path execution in a speculative multi-threaded processor
US7010676B2 (en) 2003-05-12 2006-03-07 International Business Machines Corporation Last iteration loop branch prediction upon counter threshold and resolution upon counter one
US7079147B2 (en) * 2003-05-14 2006-07-18 Lsi Logic Corporation System and method for cooperative operation of a processor and coprocessor
US20040252766A1 (en) * 2003-06-11 2004-12-16 Daeyang Foundation (Sejong University) Motion vector search method and apparatus
US20040255104A1 (en) 2003-06-12 2004-12-16 Intel Corporation Method and apparatus for recycling candidate branch outcomes after a wrong-path execution in a superscalar processor
US7668897B2 (en) 2003-06-16 2010-02-23 Arm Limited Result partitioning within SIMD data processing systems
US7783871B2 (en) 2003-06-30 2010-08-24 Intel Corporation Method to remove stale branch predictions for an instruction prior to execution within a microprocessor
US7424501B2 (en) * 2003-06-30 2008-09-09 Intel Corporation Nonlinear filtering and deblocking applications utilizing SIMD sign and absolute value operations
US7539714B2 (en) * 2003-06-30 2009-05-26 Intel Corporation Method, apparatus, and instruction for performing a sign operation that multiplies
US7373642B2 (en) 2003-07-29 2008-05-13 Stretch, Inc. Defining instruction extensions in a standard programming language
US20050024486A1 (en) 2003-07-31 2005-02-03 Viresh Ratnakar Video codec system with real-time complexity adaptation
US20050027974A1 (en) * 2003-07-31 2005-02-03 Oded Lempel Method and system for conserving resources in an instruction pipeline
US7133950B2 (en) * 2003-08-19 2006-11-07 Sun Microsystems, Inc. Request arbitration in multi-core processor
JP2005078234A (en) * 2003-08-29 2005-03-24 Renesas Technology Corp Information processor
US7237098B2 (en) * 2003-09-08 2007-06-26 Ip-First, Llc Apparatus and method for selectively overriding return stack prediction in response to detection of non-standard return sequence
US20050066305A1 (en) * 2003-09-22 2005-03-24 Lisanke Robert John Method and machine for efficient simulation of digital hardware within a software development environment
US7277592B1 (en) 2003-10-21 2007-10-02 Redrock Semiconductory Ltd. Spacial deblocking method using limited edge differences only to linearly correct blocking artifact
US7457362B2 (en) * 2003-10-24 2008-11-25 Texas Instruments Incorporated Loop deblock filtering of block coded video in a very long instruction word processor
KR100980076B1 (en) * 2003-10-24 2010-09-06 삼성전자주식회사 System and method for branch prediction with low-power consumption
US7363544B2 (en) * 2003-10-30 2008-04-22 International Business Machines Corporation Program debug method and apparatus
US7219207B2 (en) 2003-12-03 2007-05-15 Intel Corporation Reconfigurable trace cache
US8069336B2 (en) 2003-12-03 2011-11-29 Globalfoundries Inc. Transitioning from instruction cache to trace cache on label boundaries
US7401328B2 (en) 2003-12-18 2008-07-15 Lsi Corporation Software-implemented grouping techniques for use in a superscalar data processing system
US7293164B2 (en) 2004-01-14 2007-11-06 International Business Machines Corporation Autonomic method and apparatus for counting branch instructions to generate branch statistics meant to improve branch predictions
US8607209B2 (en) 2004-02-04 2013-12-10 Bluerisc Inc. Energy-focused compiler-assisted branch prediction
US7613911B2 (en) 2004-03-12 2009-11-03 Arm Limited Prefetching exception vectors by early lookup exception vectors within a cache memory
US20050216713A1 (en) 2004-03-25 2005-09-29 International Business Machines Corporation Instruction text controlled selectively stated branches for prediction via a branch target buffer
US7281120B2 (en) 2004-03-26 2007-10-09 International Business Machines Corporation Apparatus and method for decreasing the latency between an instruction cache and a pipeline processor
US20050223202A1 (en) 2004-03-31 2005-10-06 Intel Corporation Branch prediction in a pipelined processor
US20050278517A1 (en) 2004-05-19 2005-12-15 Kar-Lik Wong Systems and methods for performing branch prediction in a variable length instruction set microprocessor
US20060015706A1 (en) * 2004-06-30 2006-01-19 Chunrong Lai TLB correlated branch predictor and method for use thereof
TWI253024B (en) * 2004-07-20 2006-04-11 Realtek Semiconductor Corp Method and apparatus for block matching
TWI305323B (en) * 2004-08-23 2009-01-11 Faraday Tech Corp Method for verification branch prediction mechanisms and readable recording medium for storing program thereof
US20060047934A1 (en) * 2004-08-31 2006-03-02 Schmisseur Mark A Integrated circuit capable of memory access control
US20060095713A1 (en) * 2004-11-03 2006-05-04 Stexar Corporation Clip-and-pack instruction for processor
WO2006096612A2 (en) 2005-03-04 2006-09-14 The Trustees Of Columbia University In The City Of New York System and method for motion estimation and mode decision for low-complexity h.264 decoder
US7971042B2 (en) * 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
EP2163097A2 (en) 2007-05-25 2010-03-17 Arc International, Plc Adaptive video encoding apparatus and methods

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1993008526A1 (en) * 1991-10-21 1993-04-29 Intel Corporation Cross coupling mechanisms for microprocessor instructions using pipelining systems
US5884057A (en) * 1994-01-11 1999-03-16 Exponential Technology, Inc. Temporal re-alignment of a floating point pipeline to an integer pipeline for emulation of a load-operate architecture on a load/store processor
GB2365583B (en) * 2000-02-18 2004-08-04 Hewlett Packard Co Pipeline decoupling buffer for handling early data and late data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SMITH J E ET AL: "THE ASTRONAUTICS ZS-1 PROCESSOR" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN : VLSI IN COMPUTERS AND PROCESSORS. (ICCD). NEW YORK, OCT. 3 - 5, 1988, WASHINGTON, IEEE COMP. SOC. PRESS, US, 3 October 1988 (1988-10-03), pages 307-310, XP000093793 ISBN: 0-8186-0872-2 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7971042B2 (en) 2005-09-28 2011-06-28 Synopsys, Inc. Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline
WO2018186918A1 (en) * 2017-04-03 2018-10-11 Google Llc Vector reduction processor
US10108581B1 (en) 2017-04-03 2018-10-23 Google Llc Vector reduction processor
US10706007B2 (en) 2017-04-03 2020-07-07 Google Llc Vector reduction processor
US11061854B2 (en) 2017-04-03 2021-07-13 Google Llc Vector reduction processor
EP4086760A1 (en) * 2017-04-03 2022-11-09 Google LLC Vector reduction processor
US11940946B2 (en) 2017-04-03 2024-03-26 Google Llc Vector reduction processor

Also Published As

Publication number Publication date
US20070071101A1 (en) 2007-03-29
US20070074007A1 (en) 2007-03-29
WO2007049150A3 (en) 2007-12-27
US8218635B2 (en) 2012-07-10
US20070074012A1 (en) 2007-03-29
US7747088B2 (en) 2010-06-29
US20070070080A1 (en) 2007-03-29
US7971042B2 (en) 2011-06-28
US20070073925A1 (en) 2007-03-29
US20070071106A1 (en) 2007-03-29
US8212823B2 (en) 2012-07-03
US20070074004A1 (en) 2007-03-29

Similar Documents

Publication Publication Date Title
WO2007049150A2 (en) Architecture for microprocessor-based systems including simd processing unit and associated systems and methods
US8116379B2 (en) Method and apparatus for parallel processing of in-loop deblocking filter for H.264 video compression standard
US20230336731A1 (en) Video decoding implementations for a graphics processing unit
Zhou et al. Implementation of H. 264 decoder on general-purpose processors with media instructions
US8243815B2 (en) Systems and methods of video compression deblocking
US8369420B2 (en) Multimode filter for de-blocking and de-ringing
US8516026B2 (en) SIMD supporting filtering in a video decoding system
US8213511B2 (en) Video encoder software architecture for VLIW cores incorporating inter prediction and intra prediction
US8369419B2 (en) Systems and methods of video compression deblocking
US7457362B2 (en) Loop deblock filtering of block coded video in a very long instruction word processor
US20100321579A1 (en) Front End Processor with Extendable Data Path
US9060169B2 (en) Methods and apparatus for providing a scalable deblocking filtering assist function within an array processor
WO2006063260A2 (en) Digital signal processing structure for decoding multiple video standards
Shafique et al. Optimizing the H. 264/AVC video encoder application structure for reconfigurable and application-specific platforms
Koziri et al. Implementation of the AVS video decoder on a heterogeneous dual-core SIMD processor
US20210383504A1 (en) Apparatus and method for efficient motion estimation
WO2008037113A1 (en) Apparatus and method for processing video data
Holliman et al. MPEG decoding workload characterization
US7756351B2 (en) Low power, high performance transform coprocessor for video compression
NO329837B1 (en) Process for processor-efficient deblock filtering
Ramadurai et al. Implementation of H. 264 decoder on Sandblaster DSP
Bhatia Optimization of h. 264 high profile decoder for pentium 4 processor
Petrescu Efficient implementation of video post-processing algorithms on the BOPS parallel architecture
Mizosoe et al. Software implementation of MPEG-2 decoder on VLIW media processors
López et al. Toward the implementation of a baseline H. 264/AVC decoder onto a reconfigurable architecture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06820977

Country of ref document: EP

Kind code of ref document: A2