WO2003038601A1 - Method and apparatus for parallel shift right merge of data - Google Patents
Method and apparatus for parallel shift right merge of data Download PDFInfo
- Publication number
- WO2003038601A1 WO2003038601A1 PCT/US2002/034404 US0234404W WO03038601A1 WO 2003038601 A1 WO2003038601 A1 WO 2003038601A1 US 0234404 W US0234404 W US 0234404W WO 03038601 A1 WO03038601 A1 WO 03038601A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- operand
- data elements
- shifted
- shift
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 117
- 238000004590 computer program Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 description 56
- 230000008569 process Effects 0.000 description 54
- 238000013500 data storage Methods 0.000 description 52
- 238000007667 floating Methods 0.000 description 32
- 238000004422 calculation algorithm Methods 0.000 description 25
- 238000004364 calculation method Methods 0.000 description 24
- 238000010586 diagram Methods 0.000 description 21
- 238000001914 filtration Methods 0.000 description 21
- 238000005516 engineering process Methods 0.000 description 12
- 239000013598 vector Substances 0.000 description 11
- 238000007792 addition Methods 0.000 description 9
- 230000000875 corresponding effect Effects 0.000 description 8
- 230000004044 response Effects 0.000 description 7
- 101100285899 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) SSE2 gene Proteins 0.000 description 6
- 230000001419 dependent effect Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 229910052760 oxygen Inorganic materials 0.000 description 5
- 238000007906 compression Methods 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 229910052757 nitrogen Inorganic materials 0.000 description 4
- 229910052698 phosphorus Inorganic materials 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 239000000872 buffer Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000035508 accumulation Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 235000019580 granularity Nutrition 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 101100457838 Caenorhabditis elegans mod-1 gene Proteins 0.000 description 1
- 101150110972 ME1 gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000004091 panning Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/147—Discrete orthonormal transforms, e.g. discrete cosine transform, discrete sine transform, and variations therefrom, e.g. modified discrete cosine transform, integer transforms approximating the discrete cosine transform
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30018—Bit or string instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30025—Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30105—Register structure
- G06F9/30109—Register structure having multiple operands in a single register
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
Definitions
- the present invention relates generally to the field of microprocessors and computer systems. More particularly, the present invention relates to a method and apparatus for parallel shift right merge of data.
- BACKGROUND OF THE INVENTION [0004]
- filtering as well as convolution operations are vital to computing devices which offer playback of content, including image, audio and video data.
- current methods and instructions target the general needs of filtering and are not comprehensive.
- many architectures do not support a means for efficient filter calculations for a range of filter lengths and data types.
- data ordering within data storage devices such as SIMD registers, as well as a capability of adding adjacent values in a register and for partial data transfers between registers, are generally not supported.
- current architectures require unnecessary data type changes which minimizes the number of operations per instruction and significantly increases the number of clock cycles required to order data for arithmetic operations.
- FIG. 1 depicts a block diagram illustrating a computer system capable of implementing of one embodiment of the present invention.
- FIG. 2 depicts a block diagram illustrating an embodiment of the processor as depicted in FIG. 1 in accordance with a further embodiment of the present invention.
- FIG. 3 depicts a block diagram illustrating a packed data types according to a further embodiment of the present invention.
- FIG. 4A illustrates an in-register packed byte representations according to one embodiment of the present invention.
- FIG. 4B illustrates an in-register packed word representation according to one embodiment of the present invention.
- FIG. 4C illustrates an in-register packed double word representations according to one embodiment of the present invention.
- FIG. 5 depicts a block diagram illustrating operation of a byte shuffle instruction in accordance with an embodiment of the present invention.
- FIG. 6 depicts a block diagram illustrating a byte multiply-accumulate instruction in accordance with an embodiment of the present invention.
- FIGS. 7A-7C depict block diagrams illustrating the byte shuffle instruction of
- FIG. 5 combined with the byte multiply accumulate instruction as depicted in FIG. 6 to generate a plurality of summed-product pairs in accordance with a further embodiment of the present invention.
- FIGS. 8A-8D depict block diagrams illustrating an adjacent-add instruction in accordance with a further embodiment of the present invention.
- FIGS. 9A and 9B depict a register merge instruction in accordance with a further embodiment of the present invention.
- FIG. 10 depicts a block diagram illustrating a flowchart for efficient data processing of content data in accordance with one embodiment of the present invention.
- FIG. 11 depicts a block diagram illustrating an additional method for processing content data according to a data processing operation in accordance with a further embodiment of the present invention.
- FIG. 12 depicts a block diagram illustrating a flowchart for continued processing of content data in accordance with a further embodiment of the present invention.
- FIG. 13 depicts a block diagram illustrating a flowchart illustrating a register merge operation in accordance with a further embodiment of the present invention.
- FIG. 14 depicts a flowchart illustrating an additional method for selecting unprocessed data elements from a source data storage device in accordance with an exemplary embodiment of the present invention.
- Figure 15 is a block diagram of the micro-architecture for a processor of one embodiment that includes logic circuits to perform parallel shift right merge operations in accordance with the present invention
- Figure 16A is a block diagram of one embodiment of logic to perform a parallel shift right merge operation on data operands in accordance with the present invention
- Figure 16B is a block diagram of another embodiment of logic to perform a shift right merge operation;
- Figure 17A illustrates the operation of a parallel shift right merge instruction in accordance with a first embodiment of the present invention
- Figure 17B illustrates the operation of a shift right merge instruction in accordance with a second embodiment
- Figure 18A is a flow chart illustrating one embodiment of a method to shift right and merge data operands in parallel;
- Figure 18B is a flow chart illustrating another embodiment of a method to shift right and merge data
- Figures 19A-B illustrate an examples of motion estimation
- Figure 20 illustrates an example application of motion estimation and a resulting prediction
- Figures 21A-B illustrate example current and previous frames that are processed during motion estimation
- Figure 22A-D illustrate the operations of motion estimation on frames in accordance with one embodiment of the present invention.
- Figures 23A-B is a flow chart illustrating one embodiment of a method to predict and estimation motion.
- a method and apparatus for performing a parallel shift right merge on data is disclosed.
- a method and apparatus for efficient filtering and convolution of content data are also described.
- a method and apparatus for a fast full search motion estimation with SIMD merge operations is also disclosed.
- the embodiments described herein are described in the context of a microprocessor, but are not so limited. Although the following embodiments are described with reference to a processor, other embodiments are applicable to other types of integrated circuits and logic devices. The same techniques and teachings of the present invention can easily be applied to other types of circuits or semiconductor devices that can benefit from higher pipeline throughput and improved performance.
- the teachings of the present invention are applicable to any processor or machine that performs data manipulations. However, the present invention is not limited to processors or machines that perform 256 bit, 128 bit, 64 bit, 32 bit, or 16 bit data operations and can be applied to any processor and machine in which shift right merge of data is needed.
- the instructions can be used to cause a general-purpose or special-purpose processor that is programmed with the instructions to perform the steps of the present invention.
- the steps of the present invention might be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.
- the present invention may be provided as a computer program product or software which may include a machine or computer-readable medium having stored thereon instructions which may be used to program a computer (or other electronic devices) to perform a process according to the present invention.
- Such software can be stored within a memory in the system.
- the code can be distributed via a network or by way of other computer readable media.
- the computer-readable medium may include, but is not limited to, floppy diskettes, optical disks, Compact Disc, Read- Only Memory (CD-ROMs), and magneto-optical disks, Read-Only Memory (ROMs),
- the computer-readable medium includes any type of media/machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
- the present invention may also be downloaded as a computer program product. As such, the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client).
- the transfer of the program may be by way of electrical, optical, acoustical, or other forms of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem, network connection or the like).
- a communication link e.g., a modem, network connection or the like.
- SIMD Single Instruction, Multiple Data
- SSE Streaming SIMD Extensions
- Embodiments of the present invention provide a way to implement a parallel shift right instruction as an algorithm that makes use of SIMD related hardware.
- the algorithm is based on the concept of right shifting a desired number of data segments from one operand into the most significant side of a second operand as the same number of data segments are shifted out the least significant side of the second operand.
- the right shift merge operation can be viewed as merging two block of data together as one block and shifting the joined block to align the data segments at the desired location to form a new pattern of data.
- embodiments of a shift right merge algorithm in accordance with the present invention can be implemented in a processor to support SIMD operations efficiently without seriously compromising overall performance.
- FIG. 1 shows a computer system 100 upon which one embodiment of the present invention can be implemented.
- Computer system 100 comprises a bus 101 for communicating information, and processor 109 coupled to bus 101 for processing information.
- the computer system 100 also includes a memory subsystem 104-107 coupled to bus 101 for storing information and instructions for processor 109.
- Processor 109 includes an execution unit 130, a register file 200, a cache memory 160, a decoder 165, and an internal bus 170.
- Cache memory 160 is coupled to execution unit 130 and stores frequently and/or recently used information for processor 109.
- Register file 200 stores information in processor 109 and is coupled to execution unit 130 via internal bus 170.
- register file 200 includes multimedia registers, for example, SIMD registers for storing multimedia information.
- multimedia registers each store up to one hundred twenty-eight bits of packed data.
- Multimedia registers may be dedicated multimedia registers or registers which are used for storing multimedia information and other information.
- multimedia registers store multimedia data when performing multimedia operations and store floating point data when performing floating point operations.
- Execution unit 130 operates on packed data according to the instructions received by processor 109 that are included in packed instruction set 140. Execution unit 130 also operates on scalar data according to instructions implemented in general-purpose processors.
- Processor 109 is capable of supporting the Pentium® microprocessor instruction set and the packed instruction set 140.
- packed instruction set 140 By including packed instruction set 140 in a standard microprocessor instruction set, such as the Pentium® microprocessor instruction set, packed data instructions can be easily incorporated into existing software (previously written for the standard microprocessor instruction set).
- Other standard instruction sets such as the PowerPCTM and the AlphaTM processor instruction sets may also be used in accordance with the described invention.
- PowerPCTM is a registered trademark of Intel Corporation.
- PowerPCTM is a trademark of IBM, APPLE COMPUTER and MOTOROLA.
- the packed instruction set 140 includes instructions (as described in further detail below) for a move data (MOVD) operation 143, and a data shuffle operation (PSHUFD) 145 for organizing data within a data storage device.
- a packed multiply and accumulate for an unsigned first source register and a signed second source register (PMADDUSBW operation 147).
- a packed multiply-accumulate operation (PMADDUUBW operation 149) for performing a multiply and accumulate for an unsigned first source register and an unsigned second source register.
- the packed instruction set includes an adjacent-add instruction for adding adjacent bytes (PAADDNB operation 155), words (PAADDNWD operation 157), and doublewords (PAADDNDWD 159), two word values (PAADDWD 161), two words to produce a 16-bit result (PAADDNWW operation 163), two quadwords to produce a quadword result (PAADDNDD operation 165) and a register merger operation 167.
- PAADDNB operation 155 adjacent-add instruction for adding adjacent bytes
- PAADDNWD operation 157 words
- PAADDNDWD 159 doublewords
- PAADDWD 161 two word values
- PAADDNWW operation 163 two words to produce a 16-bit result
- PAADDNDD operation 165 two quadwords to produce a quadword result
- the operations used by many existing multimedia applications may be performed using packed data in a general-purpose processor.
- many multimedia applications may be accelerated and executed more efficiently by using the full width of a processor's data bus for performing operations on packed data. This eliminates the need to transfer smaller units of data across the processor's data bus to perform one or more operations one data element at a time.
- the computer system 100 of the present invention may include a display device 121 such as a monitor.
- the display device 121 may include an intermediate device such as a frame buffer.
- the computer system 100 also includes an input device 122 such as a keyboard, and a cursor control 123 such as a mouse, or trackball, or trackpad.
- the display device 121, the input device 122, and the cursor control 123 are coupled to bus 101.
- Computer system 100 may also include a network connector 124 such that computer system 100 is part of a local area network (LAN) or a wide area network (WAN).
- LAN local area network
- WAN wide area network
- computer system 100 can be coupled to a device for sound recording, and/or playback 125, such as an audio digitizer coupled to a microphone for recording voice input for speech recognition.
- Computer system 100 may also include a video digitizing device 126 that can be used to capture video images, a hard copy device 127 such as a printer, and a CD-ROM device 128.
- the devices 124-128 are also coupled to bus 101.
- FIG. 2 illustrates a detailed diagram of processor 109.
- Processor 109 can be implemented on one or more substrates using any of a number of process technologies, such as, BiCMOS, CMOS, and NMOS.
- Processor 109 comprises a decoder 202 for decoding control signals and data used by processor 109. Data can then be stored in register file 200 via internal bus 205.
- the registers of an embodiment should not be limited in meaning to a particular type of circuit. Rather, a register of an embodiment need only be capable of storing and providing data, and performing the functions described herein.
- the data may be stored in integer registers 201, registers 209, status registers 208, or instruction pointer register 211.
- Other registers can be included in the register file 204, for example, floating point registers.
- integer registers 201 store thirty-two bit integer data.
- registers 209 contains eight multimedia registers, R 0 212 ⁇ through R 212h, for example, SIMD registers containing packed data. Each register in registers 209 is one hundred twenty-eight bits in length.
- Rl 212fl, R2212* and R3 212c are examples of individual registers in registers 209.
- Thirty-two bits of a register in registers 209 can be moved into an integer register in integer registers 201. Similarly, a value in an integer register can be moved into thirty-two bits of a register in registers 209. [0055] Status registers 208 indicate the status of processor 109. Instruction pointer register 211 stores the address of the next instruction to be executed. Integer registers 201, registers 209, status registers 208, and instruction pointer register 211 all connect to internal bus 205. Any additional registers would also connect to the internal bus 205.
- registers 209 and integer registers 201 can be combined where each register can store either integer data or packed data.
- registers 209 can be used as floating point registers.
- packed data can be stored in registers 209 or floating point data.
- the combined registers are one hundred twenty-eight bits in length and integers are represented as one hundred twenty-eight bits. In this embodiment, in storing packed data and integer data, the registers do not need to differentiate between the two data types.
- Functional unit 203 performs the operations carried out by processor 109. Such operations may include shifts, addition, subtraction and multiplication, etc.
- FIG. 3 illustrates three packed data-types: packed byte 221, packed word 222, and packed doubleword (dword) 223.
- Packed byte 221 is one hundred twenty-eight bits long containing sixteen packed byte data elements.
- a data element is an individual piece of data that is stored in a single register (or memory location) with other data elements of the same length.
- the number of data elements stored in a register is one hundred twenty-eight bits divided by the length in bits of a data element.
- Packed word 222 is one hundred twenty-eight bits long and contains eight packed word data elements. Each packed word contains sixteen bits of information.
- Packed doubleword 223 is one hundred twenty-eight bits long and contains four packed doubleword data elements. Each packed doubleword data element contains thirty-two bits of information.
- a packed quadword is one hundred twenty-eight bits long and contains two packed quad-word data elements.
- FIGS. 4A-4C illustrate the in-register packed data storage representation according to one embodiment of the invention.
- Unsigned packed byte in-register representation 310 illustrates the storage of an unsigned packed byte 201 in one of the multimedia registers 209, as shown in FIG. 4A.
- Information for each byte data element is stored in bit seven through bit zero for byte zero, bit fifteen through bit eight for byte one, bit twenty-three through bit sixteen for byte two, and finally bit one hundred twenty through bit one hundred twenty-seven for byte fifteen.
- All available bits are used in the register. This storage arrangement increases the storage efficiency of the processor. As well, with sixteen data elements accessed, one operation can now be performed on sixteen data elements simultaneously.
- Signed packed byte in-register representation 311 illustrates the storage of a signed packed byte 221. Note that the eighth bit of every byte data element is the sign indicator.
- Unsigned packed word in-register representation 312 illustrates how word seven through word zero are stored in a register of multimedia registers 209, as illustrated in FIG. 4B. Signed packed word in-register representation 313 is similar to the unsigned packed word in-register representation 312. Note that the sixteenth bit of each word data element is the sign indicator.
- Unsigned packed doubleword in-register representation 314 shows how multi-media registers 209 store two doubleword data elements, as illustrated in
- FIG. 4C Signed packed doubleword in-register representation 315 is similar to unsigned packed doubleword in-register representation 314. Note that the necessary sign bit is the thirty-second bit of the doubleword data element.
- Efficient filtering and convolution of content data begins with loading of data source devices with data and filter/convolution co- efficients.
- data source devices such as for example a single instruction multiple data (SIMD) register
- SIMD single instruction multiple data
- efficient filter calculations and convolution require not only appropriate arithmetic instructions, but also efficient methods for organizing the data required to make the calculations.
- images are filtered by replacing the value of, for example pixel I given by S[I]. Values of pixels on either side of pixel I are used in the filter calculation of S[I]. Similarly, pixels on either side of pixel I + 1 are required to compute the S[I+1]. Consequently, to compute filter results for more than one pixel in an SIMD register, data is duplicated and arranged in the SIMD register for the calculation.
- the present invention includes a byte shuffle instruction (PSHUFB) 145 that efficiently orders data of any size.
- PSD byte shuffle instruction
- the byte shuffle operation 145 orders data sizes, which are larger than bytes, by maintaining the relative position of bytes within the larger data during the shuffle operation.
- the byte shuffle operation 145 can change the relative position of data in an SIMD register and can also duplicate data.
- FIG. 5 depicts an example of a byte shuffle operation 145 for a filter with three co-efficients.
- the present invention describes a new instruction for the data arrangement. Accordingly, as depicted in FIG. 5, the data 404 is organized within a destination data storage device 406, which in one embodiment is the source data storage device 404, utilizing a mask 402 to specify the address wherein respective data elements are stored in the destination register 406. In one embodiment, the arrangement of the mask is based on the desired data processing operation, which may include for example, a filtering operation, a convolution operation or the like.
- the source data storage device 404 is a 128-bit SIMD register, which initially stores sixteen 8-bit pixels. As such, when utilizing a pixel filter with three co-efficients, the fourth co-efficient is set to zero.
- the source register 404 can be utilized as the destination data storage device or register, thereby reducing the number of registers than is generally necessary. As such, overwritten data within the source data storage device 404 may be reloaded from memory or from another register.
- multiple registers may be used as the source data storage device 404, with their respective data organized within the destination data storage device 406 as desired.
- first and second source registers are N-bit long SIMD registers, such as for example 128-bit Intel® SSE2 XMM registers.
- the multiply and accumulate instruction implemented on such a register would give the following results for two pixel vectors 452 and 454, which is stored within the destination register 456.
- PMADDUSBW operation 147 FIG. 1
- the U and the S in the instruction mnemonically refer to unsigned and signed bytes. Bytes in one of the source registers are signed and in the other they are unsigned.
- the register with the unsigned data is the destination and the 16 multiply-accumulate results.
- the reason for this choice is that in most implementations, data is unsigned and co-efficients are signed. Accordingly, it is preferable to overwrite the data because the data is less likely to be needed in future calculations.
- Additional byte multiply-accumulate instructions as depicted in FIG. 1 are PMADDUUBW operation 149 for unsigned bytes in both registers and PMADDSSBW operation 151 for signed bytes in both source registers.
- the multiply-accumulate instructions are completed by a PMADDWD instruction 153 that applies to pairs of 16-bit signed words to produce a 32-bit signed product.
- the second vector generally contains the filter co-efficients.
- the co-efficients can be loaded within a portion of the register and copied to the rest of the register using the shuffle instruction 145.
- a coefficient data storage device 502 such as for example an XMM 128 bit register, is initially loaded with three co-efficients in response to execution of a data load instruction.
- filter co-efficients may be organized in memory prior to data processing. As such, the co-efficient may be initially loaded as depicted in FIG. 7B based on their organization within memory, prior to filtering.
- the co-efficient register 502 includes filter co-efficients F3, F2 and FI, which can be coded as signed or unsigned bytes.
- the existing instruction PSHUFD can be used to copy the filter co-efficients within the remaining portions of the co-efficient register to obtain the following result as depicted in FIG. 7B.
- the co-efficient register 504 now includes shuffled co-efficients as required to perform a data processing operation in parallel.
- filters including three co-efficients are very common in image processing algorithms. However, those skilled in the art will recognize that certain filtering operations, such as JPEG 2000 utilize nine and seven 16-bit coefficients.
- processing of such co-efficient exceeds the capacity of co-efficient registers, resulting in a partially filtered result. Consequently, processing continues until a final result is obtained using each co-efficient.
- FIG. 7C illustrates the arrangement of pixel data within a source register 506 that was initially contained within the source register 404 as depicted in FIG. 5 and shuffled within the destination register 406. Accordingly, in response to execution of a data processing operation, the PMADDUSBW instruction can be used to compute the sum of the two multiplications with the result stored in the destination register 510. Unfortunately, in order to complete calculation and generate data processing results for the selected data processing operation, adjacent summed-product pairs within the destination register 510 must be added.
- the present invention utilizes adjacent-add instructions, the results of which are depicted in FIGS. 8A-8D.
- FIG. 8A depicts a destination register 552 following adding of two adjacent 16 bit values (PADDD2WD operation 157) to give a 32 bit sum.
- FIG. 8A depicts two adjacent 16 bit results of a multiply-accumulate instruction, which are added to give 32 bit sum of 4 byte products.
- FIG. 8B depicts an adjacent-add instruction (PAADDD4WD operation 157), which adds 4 adjacent 16-bit values to give a 32-bit sum.
- PAADDD4WD operation 157 an adjacent-add instruction
- 4 adjacent 16-bit results of a byte multiply- accumulate instruction are added to give 32-bit sum of 8 byte products.
- 8C illustrates an adjacent-add instruction (PAADD8WD operation 157), which adds 8 adjacent 16-bit values to give a 32-bit sum.
- the example illustrates 8 adjacent 16- bit results of a byte multiply-accumulate operation, which are added to give a 32-bit sum of 16 byte products.
- the selection of the instruction to perform an adjacent-add operation is based on the number of turns in a sum (N). For example, utilizing a three tap filter as depicted in FIGS. 7A-7C, a first instruction (PAADD2WD operation 157) will obtain the following result as depicted in FIG. 8D.
- the last instruction (PAADD8WD operation 157), as depicted in FIG. 8C, is utilized.
- PAADD8WD operation 157 Such an operation is becoming increasingly important for an efficient implementation as SIMD registers increase in size. Without such an operation, many additional instructions are required.
- the set of adjacent-add instructions as described by the present invention, support a wide range of numbers of adjacent values which can be added and a full range of common data types.
- the data size of the sum of 16 bit adjacent-additions is 32 bits.
- adjacent 16 bit values (PAADDWD operation 161) are added to yield a 32 bit sum.
- no other instruction with the 16 bit data size is included because adjacent-add instructions with a 32 bit input are used to add the sum produced by the instruction with a 16 bit input.
- the data size of the sum of 32 bit adjacent-additions is 32 bits.
- the results do not fill the register.
- FIGS. 8A, 8B and 8C three different adjacent-adds yield 4, 2 and 1 32-bit results.
- the results are stored in the lower, least significant parts of the destination data storage device. [0080] Accordingly, when there are two 32-bit results, as depicted in FIG.
- the results are stored in the lower 64 bits. In the case of one 32-bit result, as illustrated in FIG. 8C, the results are stored in the lower 32 bits.
- some applications utilize the sum of adjacent bytes.
- the present invention supports adjacent-addition of bytes with an instruction (PAADDNB operation 155) that adds two adjacent signed bytes giving a 16-bit word and an instruction that adds two adjacent unsigned bytes giving a 1 -bit word result.
- Applications that require addition of more than two adjacent bytes add the 16-bit sum of two bytes with an appropriate 16 bit adjacent-add operation.
- results can be coded with a 32-bit precision. Therefore, results can be written back to memory using simple move operations acting on doublewords, for example, the MOVD operation 143 described above as well as Shift Right logical operations acting on the whole register (PSRLDQ), shift double quad-word right logical. As such, writing all results back to memory would need four MOVD and three PSRLDQ in the first case (FIG. 8A), two MOVD and one PSRLDQ in the second case (FIG. 8B) and finally, just one MOVD in the final case, as depicted in FIG. 8C. [0082] Unfortunately, although the adjacent-add operations, as depicted in FIG.
- the present invention describes a register merge operation 163, as depicted in FIG. 9A.
- the register merge operation utilizes the number of bytes to select registers, which is provided by an input argument.
- FIG. 9B depicts an alternate embodiment for performance of the register merge operation. Initially, eight pixels are loaded into a first source register 608 (MMO). Next, a subsequent eight pixels are loaded in a second source register (MM1) 610. Next, a permute operation is performed on the second source register 610. Once performed, register 610 is copied to a third source register (MM2) 612. Next, the first source register 608 is right-shifted by eight bits. In addition, the second source register 610 and a mask register 614 are combined in accordance with a packed logical AND instruction and stored within the first source register 608.
- MMO first source register 608
- MM1 second source register
- MM2 permute operation
- register 610 is copied to a third source register (MM2) 612.
- the first source register 608 is right-shifted by eight bits.
- the second source register 610 and a mask register 614 are combined in accordance with a packed logical AND instruction and stored within the first source register 608.
- a logical OR operation is performed between the second source register 610 and the first source register 608 to produce the following result within the destination register 620, resulting in the register merge operation.
- the process continues as illustrated by shifting the first source register 608.
- the second source register 610 is shifted to yield the register 612.
- a logical AND operation is performed between the mask register 614 and the second source register 612, with the results stored in a destination register 622.
- a packed OR operation is performed between the second source register 612 and the first source register 608 to yield a subsequent register merge operation within the destination register 624.
- FIG. 10 depicts a block diagram illustrating a method 700 for efficient filtering and convolution of content data within, for example, the computer system 100 as depicted in FIGS. 1 and 2.
- content data refers to image, audio, video and speech data.
- the present invention refers to data storage devices, which as recognized by those skilled in the art, include various devices capable of storing digital data including, for example, data registers such as 128- bit Intel® architecture SSE2 MMX registers.
- the method begins at process block 702, wherein it is determined whether a data processing operation is executed.
- the data processing operation includes, but it is not limited to, convolution and filtering operations performed on pixel data.
- a data load instruction is executed.
- input data stream data is loaded within a source data storage device 212A and a secondary data storage device 212B, for example as depicted in FIG.2.
- a selected portion of data from, for example, a source data storage device 212B is organized within a destination data storage device or according to an arrangement of co-efficients within a co-efficient data storage device (see FIG. 5).
- Coefficients within a co-efficient data storage device are organized according to the desired data processing operation calculations (for example, as illustrated in FIGS. 7 A and 7B).
- co-efficients are organized within memory prior to any filtering operations. Accordingly, co-efficients may be loaded in a co-efficient data storage without the need for shuffling (see FIG.7B).
- FIG. 11 depicts a block diagram illustrating a method 722 for processing data according to the data processing operation.
- process block 724 it is determined whether the data processing operation has executed a multiply- accumulate instruction.
- a plurality of summed-product pairs of data within the destination storage device and co-efficients within the co-efficient data storage device are generated, as depicted in FIG. 7C.
- adjacent summed-product pairs within the destination data storage device 510 are added in response to execution of the adjacent-add instruction to form one or more data processing operation results (see FIG. 8D).
- process block 732 where the number of co-efficients exceeds a capacity of the co-efficient register (see process block 732), partial data processing results are obtained. Consequently, processing and organizing of co-efficients (process block 734) data (process block 736) and continues until final data processing operation results are obtained, as indicated in optional process blocks 732-736. Otherwise, at process block 738, the one or more data processing operation results are stored. Finally, at process block 790, it is determined whether processing of input data stream data is complete. As such, process blocks 724-732 are repeated until processing of input data stream data is complete. Once processing is complete, control flow returns to process block 720, wherein the method 700 terminates. [0091] Referring now to FIG. 12, FIG. 12, FIG.
- process block 742 it is determined whether there is any unaccessed data within the source data storage device 212A.
- unaccessed data refers to data within the source data storage device 212A that has not been shuffled within the data storage device in order to perform a multiply-accumulate instruction.
- process block 744 a portion of data is selected from the source data storage device as the selected data.
- process block 786 is performed.
- process block 746 one or more unprocessed data elements are selected from the source data storage device, as well as one or more data elements from a secondary data storage device.
- unprocessed data elements refer to data elements for which a data processing operation result has not yet been calculated.
- a register merger instruction (see FIGS. 9A and 9B) is performed which concatenates the unprocessed data elements of the source data storage device with the data elements selected from the secondary data storage device to form the selected data.
- data from the secondary data storage device is moved to the source data storage device.
- the source data storage device data is no longer required, since it has all been accessed. Accordingly, the secondary storage of data, which contains unaccessed data, can be used to overwrite data within the source data storage device.
- the secondary data storage device is loaded with input data stream data from a memory device, which requires additional data processing, such as filtering or convolution.
- the selected data is organized within a destination data storage device or according to the arrangement of co-efficients within the co-efficient data storage device (see FIG. 5). Once performed, control flow returns to process block 790, as depicted in FIG. 11 for continued processing of the selected data.
- process block 750 it is determined whether the source data storage device contains unprocessed data. When each portion of data within the source data storage device has been processed, process block 770 is performed. At process block 770, a portion of data is selected from the secondary data storage device, which functions as the selected data, which is then processed in accordance with the data processing operation. [0095] Otherwise, at process block 752, one or more unprocessed data elements are selected from the source data storage device. Finally, at process block 766, additional data elements are selected from the secondary data storage device according to a count of the unprocessed data elements to form the selected data.
- FIG. 14 depicts an additional method 754 for selecting unprocessed data elements of process block 752, as depicted in FIG. 13.
- a data element is selected from the source data storage device.
- the selected data element is discarded. Otherwise, at process block 760, the selected data element is an unprocessed data element and is stored. Next, at process block 762, an unprocessed data element count is incremented. Finally, at process block 764, process blocks 756-762 are repeated until each data element within the source data storage device is processed. [0097] As such, utilizing the teachings of the present invention, unnecessary data type changes are avoided, resulting in a maximization of the number of SIMD operations per instructions. In addition, a significant reduction in the number of clock cycles required to order data for arithmetic operations is also achieved. Accordingly, Table 1 gives estimates speed-up values fir several filtering applications using the teachings and instructions described by the present invention.
- Embodiments of the present invention provides many advantages over known techniques.
- the present invention includes the ability to efficiently implement operations for filtering/convolution for multiple array lengths and data sizes and co-efficient signs. These operations are accomplished by using a few instructions that are a part of a small group of single instruction multiple data (SIMD) instructions. Accordingly, the present invention avoids unnecessary data type changes. As a result, by avoiding unnecessary data type changes, the present invention maximizes the number of SIMD operations per instruction, while significantly reducing the number of clock cycles required to order data for arithmetic operations such as multiply-accumulate operations.
- Figure 15 is a block diagram of the micro-architecture for a processor of one embodiment that includes logic circuits to perform parallel shift right merge operations in accordance with the present invention.
- the shift right merge operation may also be referred to as a register merge operation and register merge instruction as in the discussion above.
- the instruction produces the same results as the register merge operation 167 of Figs. 1, 9A and 9B.
- the in-order front end 1001 is the part of the processor 1000 that fetches the macro-instructions to be executed and prepares them to be used later in the processor pipeline.
- the front end of this embodiment includes several units.
- the instruction prefetcher 1026 fetches macro-instructions from memory and feeds them to an instruction decoder 1028 which in turn decodes them into primitives called micro-instructions or micro-operations (also called micro op or uops) that the machine know how to execute.
- the trace cache 1030 takes decoded uops and assembles them into program ordered sequences or traces in the uop queue 1034 for execution.
- the microcode ROM 1032 provides the uops needed to complete the operation.
- the trace cache 1030 refers to a entry point programmable logic array (PLA) to determine a correct microinstruction pointer for reading the micro-code sequences for the divide algorithms in the micro-code ROM 1032.
- Some SIMD and other multimedia types of instructions are considered complex instructions. Most floating point related instructions are also complex instructions. As such, when the instruction decoder 1028 encounters a complex macro- instruction, the microcode ROM 1032 is accessed at the appropriate location to retrieve the microcode sequence for that macro-instruction. The various micro-ops needed for performing that macro-instruction are communicated to the out-of-order execution engine 1003 for execution at the appropriate integer and floating point execution units. [00104] The out-of-order execution engine 1003 is where the micro-instructions are prepared for execution. The out-of-order execution logic has a number of buffers to smooth out and re-order the flow of micro-instructions to optimize performance as they go down the pipeline and get scheduled for execution.
- the allocator logic allocates the machine buffers and resources that each uop needs in order to execute.
- the register renaming logic renames logic registers onto entries in a register file.
- the allocator also allocates an entry for each uop in one of the two uop queues, one for memory operations and one for non-memory operations, in front of the instruction schedulers: memory scheduler, fast scheduler 1002, slow/general floating point scheduler 1004, and simple floating point scheduler 1006.
- the uop schedulers 1002, 1004, 1006, determine when a uop is ready to execute based on the readiness of their dependent input register operand sources and the availability of the execution resources the uops need to complete their operation.
- the fast scheduler 1002 of this embodiment can schedule on each half of the main clock cycle while the other schedulers can only schedule once per main processor clock cycle.
- the schedulers arbitrate for the dispatch ports to schedule uops for execution.
- Register files 1008, 1010 sit between the schedulers 1002, 1004, 1006, and the execution units 1012, 1014, 1016, 1018, 1020, 1022, 1024 in the execution block 1011.
- Each register file 1008, 1010 for integer and floating point operations, respectively.
- Each register file 1008, 1010, of this embodiment also includes a bypass network that can bypass or forward just completed results that have not yet been written into the register file to new dependent uops.
- the integer register file 1008 and the floating point register file 1010 are also capable of communicating data with the other.
- the integer register file 1008 is split into two separate register files, one register file for the low order 32 bits of data and a second register file for the high order 32 bits of data.
- the floating point register file 1010 of one embodiment has 128 bit wide entries because floating point instructions typically have operands from 64 to 128 bits in width.
- the execution block 1011 contains the execution units 1012, 1014, 1016, 1018, 1020, 1022, 1024, where the instructions are actually executed.
- This section includes the register files 1008, 1010, that store the integer and floating point data operand values that the micro-instructions need to execute.
- the processor 1000 of this embodiment is comprised of a number of execution units: address generation unit (AGU) 1012, AGU
- the floating point execution blocks 1022, 1024, execute floating point, MMX, SIMD, and SSE operations.
- the floating point ALU 322 of this embodiment includes a 64 bit by 64 bit floating point divider to execute divide, square root, and remainder micro-ops.
- any act involving a floating point value occurs with the floating point hardware.
- conversions between integer format and floating point format involve a floating point register file.
- a floating point divide operation happens at a floating point divider.
- non-floating point numbers and integer type are handled with integer hardware resources.
- the simple, very frequent ALU operations go to the highspeed ALU execution units 1016, 1018.
- the fast ALUs 1016, 1018, of this embodiment can execute fast operations with an effective latency of half a clock cycle.
- most complex integer operations go to the slow ALU 1020 as the slow ALU 1020 includes integer execution hardware for long latency type of operations, such as a multiplier, shifts, flag logic, and branch processing.
- Memory load/store operations are executed by the AGUs 1012, 1014.
- the integer ALUs 1016, 1018, 1020 are described in the context of performing integer operations on 64 bit data operands.
- the ALUs 1016, 1018, 1Q20 can be implemented to support a variety of data bits including 16, 32, 128, 256, etc.
- the floating point units 1022, 1024 can be implemented to support a range of operands having bits of various widths.
- the floating point units 1022, 1024 can operate on 128 bits wide packed data operands in conjunction with SIMD and multimedia instructions.
- registers is used herein to refer to the on-board processor storage locations that are used as part of macro-instructions to identify operands. In other words, the registers referred to herein are those that are visible from the outside of the processor
- the registers described herein can be implemented by circuitry within a processor using any number of different techniques, such as dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, etc.
- the registers are understood to be data registers designed to hold packed data, such as 64 bits wide MMXTM registers (mm registers) in microprocessors enabled with MMX technology from Intel Corporation of Santa Clara, California. These MMX registers, available in both integer and floating point forms, can operated with packed data elements that accompany SIMD and SSE instructions.
- 128 bits wide XMM registers relating to SSE2 technology can also be used to hold such packed data operands.
- FIG. 16A is a block diagram of one embodiment of logic to perform a parallel shift right merge operation on data operands in accordance with the present invention.
- the instruction (PSRMRG) for a shift right merge (also, a register shift) operation of this embodiment begins with three pieces of information: a first data operand
- the shift PSRMRG instruction is decoded into one micro-operation.
- the instruction may be decoded into a various number of micro-ops to perform the shift merge operation on the data operands.
- the data operands 1102, 1104, are 64 bit wide pieces of data stored in a register/memory and the shift count 1106 is an 8 bit wide immediate value.
- the data operands and shift count can be other widths such as 128/256 bits and 16 bits, respectively.
- the first operand 1102 in this example is comprised of eight data segments: P, O, N, M, L, K, J, and I.
- the second operand 1104 is also comprised of eight data segments: H, G, F, E, D, C, B, and A.
- the data segments here are of equal length and each comprise of a single byte (8 bits) of data.
- another embodiment of the present invention operates with longer 128 bit operands wherein the data segments are comprised of a single byte (8 bits) each and the 128 bit wide operand would have sixteen byte wide data segments.
- each data segment was a double word (32 bits) or a quad word (64 bits)
- the 128 bit operand would have four double word wide or two quad word wide data segments, respectively.
- embodiments of the present invention are not restricted to particular length data operands, data segments, or shift counts, and can be sized appropriately for each implementation.
- the operands 1102, 1104 can reside either in a register or a memory location or a register file or a mix.
- the data operands 1102, 1104, and the count 1106 are sent to an execution unit 1110 in the processor along with a shift right merge instruction.
- the shift right merge instruction can be in the form of a micro operation (uop) or some other decoded format.
- the two data operands 1102, 1104 are received at concatenate logic and a temporary register.
- the concatenate logic merges/joins the data segments for the two operands and places the new block of data in a temporary register.
- the new data block is comprised of sixteen data segments: P, O, N, M, L, K, J, I, H, G, F, E, D, C, B, A.
- the temporary register need to hold the combined data is 128 bits wide. For 128 bits wide data operands, a 256 bits wide temporary register is needed.
- Right shift logic 1114 in the execution unit 1110 takes the contents of the temporary register and performs a logical shift right of the data block by n data segments as requested by the count 1106.
- the count 1106 indicates the number of bytes to right shift.
- the count 1106 can also be used to indicated the number of bits, nibbles, words, double words, quad words, etc. to shift, depending on the granularity of the data segments. For this example, n is equal to 3, so the temporary register contents are shifted by three bytes. If each data segment was a word or double word wide, then the count can indicate the number of words or double words to shift, respectively.
- 0's are shifted in from the left side of the temporary register to fill up the vacated spaces as the data in the register is shifted right.
- the shift count 1106 is greater than the number of data segments in a data operand (eight in this case)
- one or more 0's can appear in the resultant 1108.
- the shift count 1106 is equal to or exceeds the total number of data segments for both operands, the resultant will comprise of all 0's, as all the data segments will have been shifted away.
- the right shift logic 1114 outputs the appropriate number of data segments from the temporary register as the resultant 1108.
- an output multiplexer or latch can be included after the right shift logic to output the resultant.
- Figure 16B is a block diagram of another embodiment of logic to perform a shift right merge operation.
- the shift right merge operation of this embodiment begins with three pieces of information: a first 64 bits wide data operand 1102, a second 64 bits wide data operand 1104, and a 8 bits wide shift count 1106.
- the shift count 1106 indicates how many places to shift the data segments. For this embodiment, the count 1106 is stated in number of bytes.
- the count may indicate the number of bits, nibbles, words, double words, or quad words to shift the data.
- the first and second operands 1102 in this example are each comprised of eight equal length, byte size data segments (H, G, F, E, D, C, B, A) and the second operand 1104 is comprised of eight data segments (P, O, N, M, L, K, J, I).
- the count n is equal to 3.
- Another embodiment of the invention can operate with alternative length operands and data segments, such as 128/256/512 bits wide operands and bit/byte/word/double word/quad word sized data segments and 8/16/32 bits wide shift counts.
- embodiments of the present invention are not restricted to particular length data operands, data segments, or shift counts, and can be sized appropriately for each implementation.
- the data operands 1102, 1104, and the count 1106 are sent to an execution unit 1120 in the processor along with a shift right merge instruction.
- the first data operand 1102 and the second data operand 1104 are received at shift left logic 1122 and shift right logic 1124, respectively.
- the count 1106 is also sent to the shift logic
- the shift left logic 1122 shifts data segments for the first operand 1102 left by the "number of data segments in the first operand - «" number of segments. As the data segments are shifted left, 0's are shifted in from the right side to fill up the vacated spaces. In this case, there are eight data segments, so the first operand 1102 is shifted left by eight minus three, or five, places. The first operand 1102 is shifted by this different value to achieve the correct data alignment for merging at the logic OR gate 1126. After the left shift here, the first data operand becomes: K, J, I, 0, 0, 0, 0, 0.
- the shift left calculation can yield a negative number, indicating a negative left shift.
- a logical left shift with a negative count is interpreted as a shift in the negative direction and is essentially a logical right shift.
- a negative left shift will bring in 0's from the left side of the first operand 1102.
- the shift right logic 1124 shifts data segments for the second operand right by n number of segments. As the data segments are shifted right, 0's are shifted in from the left side to fill up the vacated spaces. The second data operand becomes: 0, 0, 0, H, G, F, E, D. The shifted operands are outputted from the shift left/right logic 1122, 1124, and merged together at the logic OR gate 1126.
- the OR gate performs a logical or- ing of the data segments and provides a 64 bits wide resultant 1108 of this embodiment.
- Figure 17A illustrates the operation of a parallel shift right merge instruction in accordance with a first embodiment of the present invention.
- MMl 1204, MM2 1206, TEMP 1232, and DEST 1242 are generally referred to as operands or data blocks, but are not restricted as such and also include registers, register files, and memory locations.
- MMl 1204 and MM2 1206 are 64 bits wide MMX registers (also referred to as 'mm' in some instances).
- a shift count imm[y] 1202, a first operand MMl[x] 1204, and a second operand MM2[x] 1206 are sent with the parallel shift right merge instruction.
- the count 1202 is an immediate value of y bits width.
- the first 1204 and second 1206 operands are data blocks including x data segments and having total widths of 8x bits each if each data segment is a byte (8 bits).
- the first 1204 and second 1206 operands are each packed with a number of smaller data segments.
- the first data operand MMl 1204 is comprised of eight equal length data segments: P 1211, O 1212, N 1213, M 1214, L 1215, K 1216, J 1217, 1 1218.
- the second data operand MM2 1206 is comprised of eight equal length data segments: H 1221, G 1222, F 1223, E 1224, D 1225, C 1226, B 1227, A 1228.
- each of these data segments are 'x * 8' bits wide.
- each operand is 8 bytes or 64 bits wide.
- a data element can be a nibble (4 bits), word (16 bits), double word (32 bits), quad word (64 bits), etc.
- x can be 16, 32, 64, etc. data elements wide.
- the county is equal to 8 for this embodiment and the immediate can be represented as a byte.
- y can be 4, 16, 32, etc. bits wide.
- the count 1202 is not limited to an immediate value and can also be stored in a register or memory location.
- the operands MMl 1204 and MM2 1206 are merged together at state II 1230 to form a temporary data block TEMP[2x] 1232 of 2x data elements (or bytes in this case) wide.
- the merged data 1232 of this example is comprised of sixteen data segments arranged as: P, O, N, M, L, K, J, I, H, G, F, E, D, C, B, and A.
- An eight byte wide window 1234 frames eight data segments of the temporary data block 1232, starting from the rightmost edge. Thus the right edge of the window 1234 would line up with the right edge of the data block 1232 such that the window 1234 frames data segments: H, G, F, E, D, C, B, and A.
- the shift count n 1202 indicates the desired amount to right shift the merged data.
- the count value can be implemented to state the shift amount in terms of bits, nibbles, bytes, words, double words, quad words, etc., or particular number of data segments.
- the data block 1232 is shifted right 1236 by n data segments here.
- n is equal to 3 and the data block 1232 is slid three places to the right.
- Another way of looking at this is to shift the window 1234 in the opposite direction.
- the window 1234 can be conceptually viewed as shifting three places to the left from the right edge of the temporary data block 1232.
- the shift right merge instruction is accompanied at state 1 1250 by a count immfy] ofy bits, a first data operand MMl [x] of data segments, and as second data operand MM2[ ] of x data segments.
- y is equal to 8 and x is equal to 8, wherein MMl and MM2 each being 64 bits or 8 bytes wide.
- the first 1204 and second 1206 of this embodiment are packed with a number of equally sized data segments, each a byte wide in this case, "P 1211, O 1212,
- the shift count n 1202 is used to shift the first 1204 and second 1206 operands.
- the count of this embodiment indicates the number of data segments to right shift the merged data.
- the shifting occurs before the merging of the first 1204 and second 1206 operands.
- the first operand 1204 is shifted differently.
- the first operand 1204 is shifted left by x minus n data segments.
- the "x - «" calculation allows for proper data alignment at later data merging.
- the first operand 1204 is shifted to the left by five data segments or five bytes. There are 0's shifted in from the right side to fill the vacated spaces.
- shift left calculation of "x - n" can yield a negative number, which in essence indicates a negative left shift.
- a logical left shift with a negative count is interpreted as a left shift in the negative direction and is essentially a logical right shift.
- a negative left shift will bring in 0's from the left side of the first operand 1204.
- the second operand 1206 is shifted right by the shift count of 3 and 0's are shifted in from the left side to fill the vacancies.
- the shifted results are held for the first 1204 and second 1206 operands are stored in x data segments wide registers TEMPI 1266 and TEMP2 1268, respectively.
- the shifted results from TEMPI 1266 and TEMP2 1268 are merged together 1272 to generate the desired shift merged data at register DEST 1242 at state III 1270.
- shift count n 1202 is greater than x, the operand can contain one or more 0's in the resultant from the left side.
- shift count n 1202 is equal to 2x or greater, the resultant in DEST 1242 will comprise of all 0's. [00120]
- MM2 can be 64 bits data registers in a processor enabled with MMX/SSE technology or
- MMl and MM2 are source operands to a shift right merge instruction (PSRMRG) as described above.
- the shift count IMM is also an immediate to such a PSRMRG instruction.
- the destination for the resultant, DEST is also a MMX or XMM data register.
- DEST may be the same register as one of the source operands.
- a PSRMRG instruction has a first source operand MMl and a second source operand MM2.
- FIG. 18A is a flow chart illustrating one embodiment of a method to shift right and merge data operands in parallel.
- the length values of I is generally used here to represent the width of the operands and data blocks.
- L can be used to designate the width in terms of number of data segments, bits, bytes, words, etc.
- a first length L data operand is received for use with the execution of a shift merge operation.
- a second length L data operand for the shift merge operation is also received at block 1304.
- Execution logic at block 1308 concatenates the first operand and the second operand together.
- a temporary length 2L register holds the concatenated data block.
- the merged data is held in a memory location.
- the concatenated data block is shifted right by the shift count. If the count is expressed as a data segment count, then the data block is shifted right by that many data segments and 0's are shifted in from the left along the most significant end of the data block to fill the vacancies. If the count is expressed in bits or bytes, for example, the data block is similarly right shifted by that distance.
- FIG. 1312 a length L resultant is generated from the right hand side or least significant end of the shifted data block.
- the length L amount of data segments are muxed from the shifted data block to a destination register or memory location.
- Figure 18B is a flow chart illustrating another embodiment of a method to shift right and merge data.
- a first length L data operand is received for processing with a shift right and merge operation at block 1352.
- a second length L data operand is received at block 1354.
- a shift count to indicate the desired right shift distance.
- the first data operand is shifted left at block 1358 based on a calculation with the shift count.
- the calculation of one embodiment comprises subtracting the shift count from L.
- the first operand is shifted left by "L - shift count” segments, with 0's shifting in from the least significant end of the operand.
- L is expressed in bits and the count is in bytes
- the first operand would be shifted left by "Z, - shift count * 8" bits.
- the second data operand is shifted right at block 1360 by the shift count and 0's shifted in from the most significant end of the second operand to fill vacancies.
- the shifted first operand and the shifted second operand are merged together to generate a length L resultant.
- the merging yields a result comprising the desired data segments from both the first and second operands.
- MPEG Motion Picture Expert Group
- the second layer down is a group of pictures composed of one or more groups of intra and/or non-intra frames.
- the third layer down is the picture layer itself and the next layer underneath that is a slice layer.
- Each slice is a contiguous sequence of raster ordered macroblocks, most often on a row basis in typical video applications, but not limited as such.
- Each slice consists of macroblocks, which are 16 x 16 arrays of luminance pixels, or picture data elements, with two 8 x 8 arrays of associated chrominance pixels.
- the macroblocks can be further divided into distinct 8 x 8 blocks for further processing, such as transform coding.
- the macroblock is a fundamental unit for motion compensation and motion estimation, and can have motion vectors associated with it.
- macroblocks can be 16 rows by 16 columns or a variety of dimensions.
- Motion estimation is based on the premise that consecutive video frames will generally be similar except for changes induced by objects moving within the frames. If there is zero motion between frames, an encoder can easily and efficiently predict the current frame as a duplicate of the previous or prediction frame.
- the previous frame may also be called the reference frame.
- the reference frame can be the next frame or even some other frame in the sequence. Embodiments of the motion estimation are not required to compare a current frame against a previous frame. Thus any other frame used in the comparison.
- the information necessary to transmit to the encoder becomes the syntactic overhead needed to reconstruct the picture from the original reference frame.
- the differences between a best matching macroblock and the current macroblock would ideally be a lot of 0 values.
- the differences between the best match and the current macroblock are transformed and quantized.
- the quantized values are communicated to a variable length coding for compression. As 0's can compress very well, a best match having many 0 differences values is desirable. Motion vectors can also be derived from the differences values.
- Figure 19A illustrates an first example of motion estimation.
- the left frame 1402 is an sample of a previous video frame including a stick figure and a signpost.
- the right frame 1404 is an sample of a current video frame including a similar stick figure and signpost.
- panning has resulted in the signpost moving towards the right and down from its original position in the previous frame 1402.
- the stick figure with the now raised arms in the current frame has also shifted downwards to the right side from the center of the previous frame 1402.
- Motion estimation algorithms can be used to adequately represent the changes between the two video frames 1402, 1404.
- the motion estimation algorithm performs a comprehensive two dimensional (2D) spatial search for each luminance macroblock.
- motion estimation may not be directly applied to the chrominance in MPEG video as the color motion maybe adequately represented by the same motion information as the luminance.
- Many different ways are possible for implementing motion estimation and the particular scheme for conducting motion estimation is somewhat dependent upon complexity versus quality issues for that specific application. A full, exhaustive search over a wide 2D area can generally yield the best matching results. However, this performance comes at an extreme computational cost, as motion estimation is often the most computationally expensive portion of video encoding.
- Figure 19B illustrates an example of a macroblock search.
- Frames 1410, 1420 each include various macroblocks.
- the target macroblock 1430 of a current frame is the current macroblock to be matched with previous macroblocks from the previous frames 1410, 1420.
- a bad match macroblock 1412 contains a portion of a signpost and is a bad match with the current macroblock.
- a good match macroblock 1420 contains bits of a signpost and a head from the stick figure, like in the current macroblock 1430 to be coded.
- the two macroblocks 1422, 1430 have some commonality and only a slight error is visible. Because a relatively good match is found, the encoder assigns motion vectors to the macroblock. These vectors indication how far the macroblock has to be moved horizontally and vertically so that a match is made.
- Figure 20 illustrates an example application of motion estimation and a resulting prediction in generating a second frame.
- the previous frame 1510 comes before the current frame 1520 in time.
- the current frame 1520 is subtracted from the previous frame 1510 to obtain a less complicated residual error picture 1530 that can be encoded and transmitted.
- the previous frame of this example 1510 comprises of a signpost 1511 and a stick figure 1513.
- the current frame 1520 comprises of a signpost 1521 and two stick figures 1522, 1523, on a board 1524.
- Macroblock prediction can help to reduce the search window size.
- Coding efficiency can be accomplished by taking advantage of the fact that motion vectors tend to be highly correlated between macroblocks.
- the horizontal component may be compared with the previously valid horizontal motion vector and the difference coded.
- a difference for the vertical component can be calculated before coding.
- the subtraction of the current frame 1520 from the previous frame 1510 yields a residual picture 1530 including the second stick figure 1532 with upraised arms and the board 1534.
- This residual picture 1530 is compressed and transmitted.
- this residual picture 1530 is less complex to code and takes less memory than compressing and transmitting the entire current frame 1520.
- not every macroblock search will result in an acceptable match. If the encoder determines that no acceptable match exists, the particular macroblock can encoded.
- Figures 21A-B illustrate example current 1601 and previous 1650 frames that are processed during motion estimation.
- the previous frame 1650 precedes the current frame 1601 in chronological order for the video frame series.
- Each frame is comprised of a very large number of pixels that extend across the frame in horizontal and vertical directions.
- the current frame 1601 comprises of a number of macroblocks 1610, 1621- 1627, that are arranged horizontally and vertically.
- the current frame 1601 is divided into equally sized, non-overlapping macroblocks 1610, 1621-1627.
- Each of these square macroblocks are further subdivided into an equal number of rows and columns. For the same macroblock 1610, a matrix of eight rows and eight columns are visible.
- Each square of a macroblock 1610 corresponds to a single pixel.
- this sample macroblock 1610 includes 64 pixels.
- macroblocks have dimensions of sixteen rows by sixteen columns (16 x 16).
- data for each pixel comprises of eight data bits or a single word.
- data pixel can comprises of other sizes, including nibbles, words, double words, quad words, etc.
- the previous frame 1650 includes a search window 1651 in which a portion of the frame is enclosed by the search window 1651.
- the search window 1651 comprises the area in which a current macroblock from the current frame 1601 is attempted to be matched.
- the search window is divided into a number of equally sized macroblocks.
- An example macroblock 1660 having eight rows and eight columns is illustrated here, but macroblocks can comprise of a various other dimensions including having sixteen rows and sixteen columns.
- each individual macroblocks from the search window 1651 are compared in sequence with a current macroblock from the current frame to find an acceptable match.
- the upper left corner of the first previous macroblock in the search window 1651 is lined up with the upper left corner of the search window 1651.
- the direction of macroblock processing proceeds from the left side of the search window towards the right edge, pixel by pixel.
- the leftmost edge of the second macroblock is one pixel over from left edge of the search window, and so on.
- the algorithm returns to the left edge of the search window and proceeds from the first pixel of the next line. This process repeats until macroblocks for each of the pixels in the search window 1651 have been compared against the current macroblock.
- Figure 22A-D illustrate the operations of motion estimation on frames in accordance with one embodiment of the present invention.
- Embodiments of the present invention as discussed herein involve full search motion estimation algorithms. With a full search, macroblocks for all pixel positions in a search window of a previous (reference frame) are attempted matches with a macroblock from the current frame.
- the fast full search motion estimation algorithm employs SIMD shift right merge operations to quickly process packed data from frames.
- the SIMD shift right merge operations of one embodiment can also improves processor performance by reducing the number of data loads, especially unaligned memory loads, and other data manipulation instructions.
- the motion estimation procedure of one embodiment can be described in pseudo code as: for each current block in both x and y direction ⁇ for all mod 1 position in the y axis of the search window ⁇ for all mod 4 positions in the x axis of the search window ⁇ load pixel data from memory to registers; attempt block match for 4 adjacent previous macroblocks; keep track of minimum value and index location for that previous macroblock;
- previous macroblocks for each pixel location in the search window are evaluated against a current macroblock.
- this embodiment evaluates four adjacent previous macroblocks per loop.
- Pixel data is loaded from memory with memory aligned loads into registers. Through the use of shift right merge operations, this pixel data can be manipulated to form various combinations of shifted data segments appropriate to adjacent macroblocks.
- the first, second, third, and fourth pixels on the first line of a first previous macroblock can start at memory addresses 0, 1, 2, and 3, respectively. For the first pixel of the first line of a second previous macroblock, that pixel begins at memory address 1.
- a right shift merge operation on the register data can produce that necessary pixel line data for the second previous macroblock by reusing data already loaded from memory for the first previous macroblock, resulting in time and resource savings.
- Similar shift merge operations can generate the line data for other adjacent previous macroblocks like the third, fourth, and so on.
- block matching procedure for the motion estimation algorithm of one embodiment can be described in pseudo code as: block match for four adjacent previous macroblocks ⁇ for each line of 1 to m ⁇ load pixel data for one line of current macroblock; aligned memory loads of two consecutive chunks of pixel data for one line of search window from memory to registers; generate proper pixel data lines for each of the four adjacent previous macroblocks from loaded data through shift right merge operations; calculate sum of absolute differences between a line from a previous macroblock and corresponding line from current macroblock for each of four adjacent previous macroblocks; accumulate four individual sum of absolute differences values for each of four adjacent previous macroblocks; ⁇ .
- Another embodiment having at least 8 packed data registers to hold 4 different combinations of pixel data generated from shift right merge operations on two 8 data segment wide data chunks could be able to operate on 4 adjacent previous macroblocks with simply two aligned 8 data segment wide memory loads.
- Four of the 8 packed data registers are used for computation overhead: holding the first 8 data segments from the previous frame, the next 8 data segments of the previous frame, 8 data segments for the current frame, and 8 data segments from shift right merge operations.
- the other four packed data registers are used for accumulating totals for the sum of absolute differences (SAD) values for each of the four macroblocks. More packed data registers may be added for the SAD calculations and accumulations though to increase the number of reference macroblocks that are processed together. Thus if four additional packed data registers are available, four additional previous macroblocks can be processed also.
- the number of packed data registers available to hold accumulated sum of absolute differences in one embodiment can limit how many macroblocks can be processed at a time.
- memory accesses have specific granularities and are aligned with certain boundaries. For instance, one processor can make memory accesses based on 16 or 32 byte blocks. In that case, accessing data not aligned at a 16 or 32 byte boundary could require an unaligned memory access, which is costly in execution time and resources. Even worse, a desired piece of data may cross a boundary and overlap multiple memory blocks. Cache line splits that would require unaligned loads in order to access data located on two separate cache lines, can be costly. Data lines that cross a memory page boundary are even worse.
- Embodiments of the present invention employ shift right merge operations to efficiently process the data.
- two consecutive memory blocks are loaded at aligned memory boundaries and held in registers for multiple uses.
- Shift right merge operations can take these memory blocks and shift the data segments in them by the necessary distances to obtain the correct data line.
- a shift right merge instruction can take the two already loaded memory blocks and shift one data byte out of the second block and shift one data byte into the second block from the first to generate the data for the first line of the second macroblock, without having to perform an unaligned load.
- Embodiments of the motion estimation can also break dependency chains based on how the algorithm is implemented.
- FIG. 22A illustrates the progression of the current macroblocks across the current frame 1701.
- each current macroblock 1710 is divided into 16 rows and 16 columns, and thus comprising 256 individual pixels.
- the pixels in each macroblock 1710 are processed an individual row 1711 at a time.
- the next current macroblock is processed.
- the macroblocks of this embodiment are processed in a horizontal direction 1720 from the left side to the right side of the current frame 1701 at macroblock sized steps.
- the current macroblocks do not overlap in this embodiment and the current macroblocks are arranged such that each macroblock sits adjacent to the next.
- the first macroblock can extend from pixel column 1 to pixel column 16.
- the second macroblock would extend from column 17 to column 32, and so on.
- the process returns 1722 to the left edge and drops down by one macroblock height, sixteen rows in this example.
- the macroblocks one macroblock sized step down are then processed horizontally 1724 from left to right until attempted matches for the entire frame 1701 are completed.
- Figure 22B illustrates the progression of the macroblocks across the search window 1751 of a previous (reference) frame.
- the search window 1751 can be focused on a certain area and thus be smaller than the entire previous frame.
- the search window can overlap the previous frame completely.
- each previous macroblock 1760, 1765, 1770, 1775 is divided into 16 rows and 16 columns, for a total of 256 pixels in each macroblock.
- four previous macroblocks 1760, 1765, 1770, 1775, of the search window 1751 are processed in parallel against a single current block in search of a match.
- each previous macroblock 1760, 1765, 1770, 1775, in a search window 1751 can and do overlap as in this example.
- each previous macroblock is shifted by one pixel column.
- leftmost pixel on the first row of BLK 1 is pixel 1761, for BLK 2 it is pixel 1766, for BLK 3 it is pixel 1771, and pixel 1776 for BLK 4.
- each row of a previous macroblock 1760, 1765, 1770, 1775 is compared against a corresponding row of a current block. For example, row 1 of BLK 1
- BLK 2 1765, BLK 3 1770, and BLK 4 1775 is each processed with a current block row 1.
- the row by row comparison for the four overlapping, adjacent macroblocks continues until all 16 rows of the macroblocks are done.
- the algorithm of this embodiment shifts over by four pixel columns to operate on the next four macroblocks.
- the leftmost first pixel column for the next four macroblocks would be pixel 1796, pixel 1797, pixel 1798, and pixel 1799, respectively.
- the previous macroblock processing continues rightward 1780 across the search window 1751, wrapping around 1782 to restart down one pixel row at the leftmost pixel of the search window 1751, until the search window is completed.
- the current macroblocks of a current frame of this embodiment do not overlap and next individual macroblocks are a macroblock height or width
- the previous macroblocks of a previous or reference frame do overlap and next macroblocks are incremented by a single pixel row or column.
- the four reference macroblock 1760, 1765, 1770, 1775, of this example are adjacent and differ by a single pixel column over
- any macroblock in the search window 1751 that overlaps a specified region around a chosen pixel location can be processed together with the macroblock at that pixel location.
- the macroblock 1760 at pixel 1796 is being processed. Any macroblock within a 16 x 16 window around pixel 1796 can be handled together with macroblock 1760.
- the 16 x 16 window of this example is due to the dimensions of a macroblock and the line width of a row. In this case, one row or data line has 16 data elements. Because this block matching function for this embodiment of a motion estimation algorithm can load two data lines of 16 data elements and perform shift right merges to generate various data lines having shifted/merged versions of the two data lines, other macroblocks that overlap the 16 x 16 window for which data will be loaded for this macroblock will be able to at least partially reuse that loaded data. Thus any macroblock overlapping the macroblock 1760, such as macroblocks 1765, 1765, 1770, 1775, or a macroblock starting at the bottom right pixel position of macroblock 1760, can be processed together with macroblock 1760. The difference in the amount of overlap influences the amount of data that can be reused from previous data loads.
- the macroblock analysis comprises a comparison between a previous (reference) macroblock and a current macroblock on a row by row basis to obtain a sum of absolute differences value between two macroblocks.
- the sum of absolute differences value can indicate how different the macroblocks are and how close of a match exists.
- Each previous macroblock for one embodiment can be represented by a value obtained by accumulating the sum of absolute differences for all sixteen rows in the macroblock. For the current macroblock that is being analyzed, a notation of the closest matching macroblock is maintained. For instance, the minimum accumulated sum of absolute differences value and the location index for that corresponding previous macroblock is tracked.
- the accumulated sum of each previous macroblock is compared against the minimum value. If the more recent previous macroblock has a smaller accumulated differences value than that of the tracked minimum value, thus indicating a closer match than the existing closest match, then the accumulated differences value and the index information for that recent previous macroblock becomes the new minimum differences value and index.
- the indexed macroblock with the minimum differences value can be used in helping to obtaining a residual picture for compression of that current frame.
- Figure 22C illustrates the parallel processing of four reference macroblocks
- the data for the pixels in the search window are ordered as "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P" 1860, wherein "A” is at the lowest address position (0) in the data set and "P" is at the highest address position (15).
- This set of pixels 1860 comprises of two sections 1681, 1682, having eight (m) data segments each.
- a right shift merge operations allows embodiments of the present invention to manipulate operands with the two data sections 1618, 1682, and to generate properly aligned row data 1830 for the different previous macroblocks 1810, 1815, 1820, 1825.
- Each macroblock, previous 1810, 1815, 1820, 1825, and current 1840 has a size of m rows by m columns.
- m is equal to eight in this example.
- Alternative embodiments can have different sized macroblocks wherein m is equal to 4, 16, 32, 64, 128, 256, etc., for example.
- the motion estimation algorithm is applied to the first row of these four previous blocks 1810, 1815, 1820, 1825, with that of the current block 1840.
- the pixel data including the two data sections 1861, 1862, for two macroblocks width (2m) is loaded from memory with two aligned memory load operations and held in temporary registers. Shift right merge operations on these two data sections 1861, 1862, allows for the generation of nine possible combinations of row data 1830 without numerous memory accesses. Furthermore, unaligned memory loads, which are costly in execution time and resources, can be avoided.
- the two data sections 1861, 1862 are aligned with byte boundaries.
- ROW 1 1811 comprises of "A, B, C, D, E, F, G, H".
- ROW 1 1816 of BLOCK 2 1815 comprises of "B, C, D, E, F, G, H, I".
- BLOCK 2 1815 begins with pixel data B whereas BLOCK 1 1810 begins with pixel data A and the second pixel data is B.
- BLOCK 3 1820 is one more pixel over to the right and ROW 1 1821 of BLOCK 3 1820 begins with pixel data C and comprises of "C, D, E, F, G, H, I, J".
- a shift right merge operation on operands of the two data sections 1861, 1862, with a shift count of two produces the BLOCK 3 ROW 1 data.
- ROW 1 1826 of BLOCK 4 1825 is comprised of "D, E, F, G, H, I, J, K". This data can be produced with a shift right merge operation of four count on the same data operands.
- each row of a previous macroblock 1810, 1815, 1820, 1825 is compared with the corresponding row of the current block 1840 to obtain a sum of absolute differences value.
- ROW 1 1811 of BLOCK 1 1810 is compared with ROW 1 1841 of the current block
- a running total for the sum of absolute differences is accumulated in a temporary register.
- four registers accumulates the sum of absolute differences until all m rows of that macroblock are done.
- the accumulated value for each block is compared with the existing minimum differences value as part of a best macroblock match search.
- this example describes the processing of four adjacent, overlapping previous macroblocks, other macroblocks that overlap the first block BLK 1810 in the search window can also be processed together with the data loads for BLK 1810 if the data lines are relevant.
- a macroblock within a 16 x 16 window around the present macroblock being processed can be processed too.
- FIG 22D illustrates the sum of absolute differences (SAD) operations 1940 and summation of those SAD values.
- SAD absolute differences
- each of the rows from ROW A to ROW P for the reference macroblock BLOCK 1 1900 and its counterpart for the current macroblock 1920 undergo a SAD operation 1940.
- the SAD operation 1940 compares the data representing the pixels in each row and calculates a value representing the absolute differences between the two rows, one from the previous macroblock 1900 and one from the current macroblock 1920.
- the values from these SAD operations 1940 for all the rows, A through P, are summed together as a block sum 1942.
- This block sum 1942 provides an accumulated value of the sum of absolute differences for the entire previous macroblock 1900 and the current macroblock 1920.
- the motion estimation algorithm can determine how similar or close of a match the previous macroblock 1900 is with respect to this current macroblock 1920.
- this embodiment operates on four reference macroblocks at a time
- alternative embodiments can work on a different number of macroblocks depending on the amount of pixel data loaded and the number of available registers.
- a variety of registers can be used during a motion estimation process.
- extended registers such as mm registers of MMX technology or XMM registers of SSE2 technology can be used to hold packed data like the pixel data.
- a 64 bits wide MMX register can hold eight bytes, or eight individual pixels if each pixel has eight bits of data.
- a 128 bits wide XMM register can hold sixteen bytes, or sixteen individual pixels if each pixel has eight bits of data.
- registers of other sizes such as 32/128/256/512 bits wide, that can hold packed data can also be used with embodiments of the present invention.
- calculations that do not require a packed data register such as regular integer operations, can use integer registers and integer hardware.
- FIG. 23A is a flow chart illustrating one embodiment of a method to predict and estimation motion.
- the tracked minimum (min) value and index location for that minimum value is initialized.
- the tracked min value and index indicate which of the processed previous (reference) macroblocks from the search window is the closest match to current macroblock.
- a check is made at block 2004 as to whether all the desired macroblocks in the current frame have been completed. If so, this portion of the motion estimation algorithm is done. If not all the desired current macroblocks have been processed, an unprocessed current macroblock is selected for the current frame at block 2006.
- the block matching proceeds from the first pixel position in the search window of the previous (reference) frame at block 2008.
- a check of whether the search window has been completed is made. With the first pass, none of the search window has been processed. But with a subsequent pass, if the entire search window has been processed, the flow returns to block 2004 to detennine if other current macroblocks are available. [00148] If the entire search window has not been analyzed, a check at block 2012 is made to determine if all the pixels along this X axis row has been processed. If this row has been done, the row count increments to the next row and the flow returns to block 2010 to see if more macroblocks on this new row are available in the search window.
- FIG. 23B is a flow chart further describing the block matching of Fig. 23 A.
- the data for the reference macroblock and the current macroblock are loaded.
- the reference macroblock data is loaded as two packed data chunks includes data for a number of consecutive pixels.
- each packed data chunk comprises of eight data elements.
- shift right merge operations are performed as needed on the data chunks to obtain correct data chunk.
- shift right merge operations can generate for data chunks corresponding to the lines located in each macroblock.
- the data chunk for each adjacent macroblock one pixel position over is also shifted one over, wherein the macroblocks appear to slide across a search window one pixel at a time for each pixel row in the search window.
- the operations at blocks 2226, 2228, 2230, and 2232, are applied to each of the four previous macroblocks being processed together.
- all four macroblocks undergo the same operation before the next operation occurs.
- a single previous macroblock may complete all the operations before the next previous macroblock with a data chunk including the appropriately shifted data segments is processed.
- the sum of absolute differences between the corresponding lines of the previous macroblock and the current macroblock is calculated for each row of these macroblocks at block 2226.
- the sum of absolute differences for all the lines in the previous macroblock are accumulated together.
- the accumulated differences value for the previous macroblock is compared against the present minimum value. If the differences value for this previous macroblock is less than the present min value at block 2232, the min value is updated with this new differences value.
- the index is also updated to reflect the location of this previous macroblock to indicate that this previous macroblock is the closest match so far. But if the new differences value is greater than the present min value at block 2232, then this previous block is not a closer match than what has been matched so far.
- Embodiments of motion estimation algorithms in accordance with the present invention can also improve processor and system performance with present hardware resources. But as technology continues to improve, embodiments of the present invention when combined with greater amounts of hardware resources and faster, more efficient logic circuits, can have an even more profound impact on improving performance. Thus, one efficient embodiment of the motion estimation can have different and greater impact across processor generations. Simply adding more resources in modern processor architectures alone does not guarantee better performance improvement. By also maintaining the efficiency of applications like one embodiment of the motion estimation and the shift right merge instruction (PSRMRG), larger performance improvements can be possible.
- PSRMRG shift right merge instruction
- lines/rows can be sixteen pixels wide or sixteen data elements wide and macroblocks be sixteen rows by sixteen columns.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10297000T DE10297000B4 (en) | 2001-10-29 | 2002-10-28 | Method and apparatus for parallel data shifting to the right with data merging |
JP2003540797A JP4623963B2 (en) | 2001-10-29 | 2002-10-28 | Method and apparatus for efficiently filtering and convolving content data |
KR1020037017220A KR100602532B1 (en) | 2001-10-29 | 2002-10-28 | Method and apparatus for parallel shift right merge of data |
HK05101290A HK1068985A1 (en) | 2001-10-29 | 2005-02-16 | Method and apparatus for parallel shift right merge of data |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/952,891 | 2001-10-29 | ||
US09/952,891 US7085795B2 (en) | 2001-10-29 | 2001-10-29 | Apparatus and method for efficient filtering and convolution of content data |
US10/280,612 | 2002-10-25 | ||
US10/280,612 US7685212B2 (en) | 2001-10-29 | 2002-10-25 | Fast full search motion estimation with SIMD merge instruction |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003038601A1 true WO2003038601A1 (en) | 2003-05-08 |
Family
ID=26960393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2002/034404 WO2003038601A1 (en) | 2001-10-29 | 2002-10-28 | Method and apparatus for parallel shift right merge of data |
Country Status (8)
Country | Link |
---|---|
US (1) | US7685212B2 (en) |
JP (2) | JP4623963B2 (en) |
KR (1) | KR100602532B1 (en) |
CN (1) | CN1269027C (en) |
DE (1) | DE10297000B4 (en) |
HK (1) | HK1068985A1 (en) |
RU (1) | RU2273044C2 (en) |
WO (1) | WO2003038601A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004040439A2 (en) * | 2002-10-25 | 2004-05-13 | Intel Corporation | Method and apparatus for parallel shift right merge of data |
CN1297887C (en) * | 2003-11-28 | 2007-01-31 | 凌阳科技股份有限公司 | Processor and method for trans-boundary aligned multiple transient memory data |
JP2007528545A (en) * | 2004-03-10 | 2007-10-11 | アーム・リミテッド | Apparatus and method for inserting bits into a data word |
US7340495B2 (en) | 2001-10-29 | 2008-03-04 | Intel Corporation | Superior misaligned memory load and copy using merge hardware |
US7685212B2 (en) | 2001-10-29 | 2010-03-23 | Intel Corporation | Fast full search motion estimation with SIMD merge instruction |
US7818356B2 (en) | 2001-10-29 | 2010-10-19 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US8214626B2 (en) | 2001-10-29 | 2012-07-03 | Intel Corporation | Method and apparatus for shuffling data |
JP2015133132A (en) * | 2004-11-03 | 2015-07-23 | インテル コーポレイション | programmable data processing circuit that supports SIMD instruction |
US11360744B2 (en) * | 2017-06-29 | 2022-06-14 | Beijing Qingying Machine Visual Technology Co., Ltd. | Two-dimensional data matching method, device and logic circuit |
EP4202651A1 (en) * | 2021-12-23 | 2023-06-28 | INTEL Corporation | Apparatus and method for vector packed concatenate and shift of specific portions of quadwords |
US11720800B2 (en) | 2016-10-04 | 2023-08-08 | Magic Leap, Inc. | Efficient data layouts for convolutional neural networks |
US11803377B2 (en) | 2017-09-08 | 2023-10-31 | Oracle International Corporation | Efficient direct convolution using SIMD instructions |
WO2024003526A1 (en) * | 2022-06-30 | 2024-01-04 | Arm Limited | Vector extract and merge instruction |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7739319B2 (en) * | 2001-10-29 | 2010-06-15 | Intel Corporation | Method and apparatus for parallel table lookup using SIMD instructions |
US7624138B2 (en) * | 2001-10-29 | 2009-11-24 | Intel Corporation | Method and apparatus for efficient integer transform |
US7725521B2 (en) * | 2001-10-29 | 2010-05-25 | Intel Corporation | Method and apparatus for computing matrix transformations |
US7631025B2 (en) * | 2001-10-29 | 2009-12-08 | Intel Corporation | Method and apparatus for rearranging data between multiple registers |
US7630585B2 (en) * | 2004-06-25 | 2009-12-08 | Intel Corporation | Image processing using unaligned memory load instructions |
JP4488805B2 (en) * | 2004-06-25 | 2010-06-23 | パナソニック株式会社 | Motion vector detection apparatus and method |
KR100774068B1 (en) * | 2004-09-14 | 2007-11-06 | 마츠시타 덴끼 산교 가부시키가이샤 | Barrel shift device |
JP4453518B2 (en) * | 2004-10-29 | 2010-04-21 | ソニー株式会社 | Encoding and decoding apparatus and encoding and decoding method |
US20070073925A1 (en) * | 2005-09-28 | 2007-03-29 | Arc International (Uk) Limited | Systems and methods for synchronizing multiple processing engines of a microprocessor |
US9432679B2 (en) * | 2005-11-01 | 2016-08-30 | Entropic Communications, Llc | Data processing system |
JP5145659B2 (en) * | 2006-06-19 | 2013-02-20 | 日本電気株式会社 | Vector renaming method and vector computer |
US8250618B2 (en) * | 2006-09-18 | 2012-08-21 | Elemental Technologies, Inc. | Real-time network adaptive digital video encoding/decoding |
JP4686435B2 (en) * | 2006-10-27 | 2011-05-25 | 株式会社東芝 | Arithmetic unit |
WO2008079041A1 (en) * | 2006-12-27 | 2008-07-03 | Intel Corporation | Methods and apparatus to decode and encode video information |
KR101520027B1 (en) * | 2007-06-21 | 2015-05-14 | 삼성전자주식회사 | Method and apparatus for motion estimation |
US8184715B1 (en) | 2007-08-09 | 2012-05-22 | Elemental Technologies, Inc. | Method for efficiently executing video encoding operations on stream processor architectures |
JP2009055291A (en) * | 2007-08-27 | 2009-03-12 | Oki Electric Ind Co Ltd | Motion detecting circuit |
US8121197B2 (en) * | 2007-11-13 | 2012-02-21 | Elemental Technologies, Inc. | Video encoding and decoding using parallel processors |
US20120027262A1 (en) * | 2007-12-12 | 2012-02-02 | Trident Microsystems, Inc. | Block Matching In Motion Estimation |
US8078836B2 (en) | 2007-12-30 | 2011-12-13 | Intel Corporation | Vector shuffle instructions operating on multiple lanes each having a plurality of data elements using a common set of per-lane control bits |
US8755515B1 (en) | 2008-09-29 | 2014-06-17 | Wai Wu | Parallel signal processing system and method |
EP2424243B1 (en) * | 2010-08-31 | 2017-04-05 | OCT Circuit Technologies International Limited | Motion estimation using integral projection |
US20120254589A1 (en) * | 2011-04-01 | 2012-10-04 | Jesus Corbal San Adrian | System, apparatus, and method for aligning registers |
DE102011075261A1 (en) * | 2011-05-04 | 2012-11-08 | Robert Bosch Gmbh | Method for editing video data and an arrangement for carrying out the method |
US9823928B2 (en) * | 2011-09-30 | 2017-11-21 | Qualcomm Incorporated | FIFO load instruction |
US9715383B2 (en) * | 2012-03-15 | 2017-07-25 | International Business Machines Corporation | Vector find element equal instruction |
JP5730812B2 (en) * | 2012-05-02 | 2015-06-10 | 日本電信電話株式会社 | Arithmetic apparatus, method and program |
US8767252B2 (en) | 2012-06-07 | 2014-07-01 | Xerox Corporation | System and method for merged image alignment in raster image data |
CN103489427B (en) * | 2012-06-14 | 2015-12-02 | 深圳深讯和科技有限公司 | YUV converts the method and system that RGB and RGB converts YUV to |
US9195521B2 (en) * | 2012-07-05 | 2015-11-24 | Tencent Technology (Shenzhen) Co., Ltd. | Methods for software systems and software systems using the same |
US9395988B2 (en) * | 2013-03-08 | 2016-07-19 | Samsung Electronics Co., Ltd. | Micro-ops including packed source and destination fields |
CN104243085A (en) * | 2013-06-08 | 2014-12-24 | 阿尔卡特朗讯 | Method and device used for coding and recombining bit data and base station controller |
US9880845B2 (en) * | 2013-11-15 | 2018-01-30 | Qualcomm Incorporated | Vector processing engines (VPEs) employing format conversion circuitry in data flow paths between vector data memory and execution units to provide in-flight format-converting of input vector data to execution units for vector processing operations, and related vector processor systems and methods |
EP3001307B1 (en) * | 2014-09-25 | 2019-11-13 | Intel Corporation | Bit shuffle processors, methods, systems, and instructions |
US20160125263A1 (en) | 2014-11-03 | 2016-05-05 | Texas Instruments Incorporated | Method to compute sliding window block sum using instruction based selective horizontal addition in vector processor |
GB2540939B (en) * | 2015-07-31 | 2019-01-23 | Advanced Risc Mach Ltd | An apparatus and method for performing a splice operation |
EP3336692B1 (en) * | 2016-12-13 | 2020-04-29 | Arm Ltd | Replicate partition instruction |
JP6733569B2 (en) * | 2017-02-06 | 2020-08-05 | 富士通株式会社 | Shift operation circuit and shift operation method |
US10481870B2 (en) | 2017-05-12 | 2019-11-19 | Google Llc | Circuit to perform dual input value absolute value and sum operation |
CN108540799B (en) * | 2018-05-16 | 2022-06-03 | 重庆堂堂网络科技有限公司 | Compression method capable of accurately representing difference between two frames of images of video file |
CN110221807B (en) * | 2019-06-06 | 2021-08-03 | 龙芯中科(合肥)技术有限公司 | Data shifting method, device, equipment and computer readable storage medium |
JP6979987B2 (en) * | 2019-07-31 | 2021-12-15 | 株式会社ソニー・インタラクティブエンタテインメント | Information processing equipment |
US11601656B2 (en) * | 2021-06-16 | 2023-03-07 | Western Digital Technologies, Inc. | Video processing in a data storage device |
CN114816531B (en) * | 2022-04-18 | 2023-05-02 | 海飞科(南京)信息技术有限公司 | Method for implementing large bit width addition operand fetch and add operation using narrow addition data channel |
US11941397B1 (en) * | 2022-05-31 | 2024-03-26 | Amazon Technologies, Inc. | Machine instructions for decoding acceleration including fuse input instructions to fuse multiple JPEG data blocks together to take advantage of a full SIMD width of a processor |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0130380A2 (en) * | 1983-06-30 | 1985-01-09 | International Business Machines Corporation | Mechanism for implementing one machine cycle executable mask and rotate instructions in a primitive instruction set computing system |
EP0363176A2 (en) * | 1988-10-07 | 1990-04-11 | International Business Machines Corporation | Word organised data processors |
US5909572A (en) * | 1996-12-02 | 1999-06-01 | Compaq Computer Corp. | System and method for conditionally moving an operand from a source register to a destination register |
US5933650A (en) * | 1997-10-09 | 1999-08-03 | Mips Technologies, Inc. | Alignment and ordering of vector elements for single instruction multiple data processing |
US6115812A (en) * | 1998-04-01 | 2000-09-05 | Intel Corporation | Method and apparatus for efficient vertical SIMD computations |
Family Cites Families (70)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3711692A (en) | 1971-03-15 | 1973-01-16 | Goodyear Aerospace Corp | Determination of number of ones in a data field by addition |
US3723715A (en) | 1971-08-25 | 1973-03-27 | Ibm | Fast modulo threshold operator binary adder for multi-number additions |
US4139899A (en) | 1976-10-18 | 1979-02-13 | Burroughs Corporation | Shift network having a mask generator and a rotator |
US4161784A (en) | 1978-01-05 | 1979-07-17 | Honeywell Information Systems, Inc. | Microprogrammable floating point arithmetic unit capable of performing arithmetic operations on long and short operands |
US4418383A (en) | 1980-06-30 | 1983-11-29 | International Business Machines Corporation | Data flow component for processor and microprocessor systems |
US4393468A (en) | 1981-03-26 | 1983-07-12 | Advanced Micro Devices, Inc. | Bit slice microprogrammable processor for signal processing applications |
JPS57209570A (en) | 1981-06-19 | 1982-12-22 | Fujitsu Ltd | Vector processing device |
US4498177A (en) | 1982-08-30 | 1985-02-05 | Sperry Corporation | M Out of N code checker circuit |
US4707800A (en) | 1985-03-04 | 1987-11-17 | Raytheon Company | Adder/substractor for variable length numbers |
JPS6297060A (en) | 1985-10-23 | 1987-05-06 | Mitsubishi Electric Corp | Digital signal processor |
US4989168A (en) | 1987-11-30 | 1991-01-29 | Fujitsu Limited | Multiplying unit in a computer system, capable of population counting |
US5019968A (en) | 1988-03-29 | 1991-05-28 | Yulan Wang | Three-dimensional vector processor |
US4903228A (en) | 1988-11-09 | 1990-02-20 | International Business Machines Corporation | Single cycle merge/logic unit |
KR920007505B1 (en) | 1989-02-02 | 1992-09-04 | 정호선 | Multiplier by using neural network |
US5081698A (en) | 1989-02-14 | 1992-01-14 | Intel Corporation | Method and apparatus for graphics display data manipulation |
US5497497A (en) | 1989-11-03 | 1996-03-05 | Compaq Computer Corp. | Method and apparatus for resetting multiple processors using a common ROM |
US5168571A (en) | 1990-01-24 | 1992-12-01 | International Business Machines Corporation | System for aligning bytes of variable multi-bytes length operand based on alu byte length and a number of unprocessed byte data |
US5268995A (en) | 1990-11-21 | 1993-12-07 | Motorola, Inc. | Method for executing graphics Z-compare and pixel merge instructions in a data processor |
US5680161A (en) | 1991-04-03 | 1997-10-21 | Radius Inc. | Method and apparatus for high speed graphics data compression |
US5187679A (en) | 1991-06-05 | 1993-02-16 | International Business Machines Corporation | Generalized 7/3 counters |
US5321810A (en) | 1991-08-21 | 1994-06-14 | Digital Equipment Corporation | Address method for computer graphics system |
US5423010A (en) | 1992-01-24 | 1995-06-06 | C-Cube Microsystems | Structure and method for packing and unpacking a stream of N-bit data to and from a stream of N-bit data words |
US5426783A (en) | 1992-11-02 | 1995-06-20 | Amdahl Corporation | System for processing eight bytes or less by the move, pack and unpack instruction of the ESA/390 instruction set |
US5408670A (en) | 1992-12-18 | 1995-04-18 | Xerox Corporation | Performing arithmetic in parallel on composite operands with packed multi-bit components |
US5465374A (en) | 1993-01-12 | 1995-11-07 | International Business Machines Corporation | Processor for processing data string by byte-by-byte |
US5524256A (en) | 1993-05-07 | 1996-06-04 | Apple Computer, Inc. | Method and system for reordering bytes in a data stream |
JPH0721034A (en) * | 1993-06-28 | 1995-01-24 | Fujitsu Ltd | Character string copying processing method |
US5625374A (en) | 1993-09-07 | 1997-04-29 | Apple Computer, Inc. | Method for parallel interpolation of images |
US5390135A (en) | 1993-11-29 | 1995-02-14 | Hewlett-Packard | Parallel shift and add circuit and method |
US5487159A (en) | 1993-12-23 | 1996-01-23 | Unisys Corporation | System for processing shift, mask, and merge operations in one instruction |
US5781457A (en) | 1994-03-08 | 1998-07-14 | Exponential Technology, Inc. | Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU |
US5594437A (en) | 1994-08-01 | 1997-01-14 | Motorola, Inc. | Circuit and method of unpacking a serial bitstream |
US5579253A (en) | 1994-09-02 | 1996-11-26 | Lee; Ruby B. | Computer multiply instruction with a subresult selection option |
US6275834B1 (en) | 1994-12-01 | 2001-08-14 | Intel Corporation | Apparatus for performing packed shift operations |
ZA9510127B (en) | 1994-12-01 | 1996-06-06 | Intel Corp | Novel processor having shift operations |
US6738793B2 (en) | 1994-12-01 | 2004-05-18 | Intel Corporation | Processor capable of executing packed shift operations |
US5636352A (en) * | 1994-12-16 | 1997-06-03 | International Business Machines Corporation | Method and apparatus for utilizing condensed instructions |
GB9509989D0 (en) | 1995-05-17 | 1995-07-12 | Sgs Thomson Microelectronics | Manipulation of data |
US6381690B1 (en) | 1995-08-01 | 2002-04-30 | Hewlett-Packard Company | Processor for performing subword permutations and combinations |
US6643765B1 (en) | 1995-08-16 | 2003-11-04 | Microunity Systems Engineering, Inc. | Programmable processor with group floating point operations |
US6385634B1 (en) | 1995-08-31 | 2002-05-07 | Intel Corporation | Method for performing multiply-add operations on packed data |
US7085795B2 (en) | 2001-10-29 | 2006-08-01 | Intel Corporation | Apparatus and method for efficient filtering and convolution of content data |
CN101794212B (en) | 1995-08-31 | 2015-01-07 | 英特尔公司 | Apparatus for controlling site adjustment of shift grouped data |
US5819117A (en) | 1995-10-10 | 1998-10-06 | Microunity Systems Engineering, Inc. | Method and system for facilitating byte ordering interfacing of a computer system |
US5880979A (en) | 1995-12-21 | 1999-03-09 | Intel Corporation | System for providing the absolute difference of unsigned values |
US5793661A (en) | 1995-12-26 | 1998-08-11 | Intel Corporation | Method and apparatus for performing multiply and accumulate operations on packed data |
US5838984A (en) | 1996-08-19 | 1998-11-17 | Samsung Electronics Co., Ltd. | Single-instruction-multiple-data processing using multiple banks of vector registers |
US6223277B1 (en) | 1997-11-21 | 2001-04-24 | Texas Instruments Incorporated | Data processing circuit with packed data structure capability |
US6192467B1 (en) | 1998-03-31 | 2001-02-20 | Intel Corporation | Executing partial-width packed data instructions |
US6041404A (en) | 1998-03-31 | 2000-03-21 | Intel Corporation | Dual function system and method for shuffling packed data elements |
US6211892B1 (en) | 1998-03-31 | 2001-04-03 | Intel Corporation | System and method for performing an intra-add operation |
US6418529B1 (en) | 1998-03-31 | 2002-07-09 | Intel Corporation | Apparatus and method for performing intra-add operation |
US6288723B1 (en) | 1998-04-01 | 2001-09-11 | Intel Corporation | Method and apparatus for converting data format to a graphics card |
US6098087A (en) * | 1998-04-23 | 2000-08-01 | Infineon Technologies North America Corp. | Method and apparatus for performing shift operations on packed data |
US6263426B1 (en) | 1998-04-30 | 2001-07-17 | Intel Corporation | Conversion from packed floating point data to packed 8-bit integer data in different architectural registers |
US20020002666A1 (en) | 1998-10-12 | 2002-01-03 | Carole Dulong | Conditional operand selection using mask operations |
US6484255B1 (en) | 1999-09-20 | 2002-11-19 | Intel Corporation | Selective writing of data elements from packed data based upon a mask using predication |
US6546480B1 (en) | 1999-10-01 | 2003-04-08 | Hitachi, Ltd. | Instructions for arithmetic operations on vectored data |
US6430684B1 (en) | 1999-10-29 | 2002-08-06 | Texas Instruments Incorporated | Processor circuits, systems, and methods with efficient granularity shift and/or merge instruction(s) |
US20050188182A1 (en) | 1999-12-30 | 2005-08-25 | Texas Instruments Incorporated | Microprocessor having a set of byte intermingling instructions |
US6745319B1 (en) | 2000-02-18 | 2004-06-01 | Texas Instruments Incorporated | Microprocessor with instructions for shuffling and dealing data |
WO2001067235A2 (en) | 2000-03-08 | 2001-09-13 | Sun Microsystems, Inc. | Processing architecture having sub-word shuffling and opcode modification |
WO2001069938A1 (en) * | 2000-03-15 | 2001-09-20 | Digital Accelerator Corporation | Coding of digital video with high motion content |
US7155601B2 (en) | 2001-02-14 | 2006-12-26 | Intel Corporation | Multi-element operand sub-portion shuffle instruction execution |
KR100446235B1 (en) * | 2001-05-07 | 2004-08-30 | 엘지전자 주식회사 | Merging search method of motion vector using multi-candidates |
US7818356B2 (en) | 2001-10-29 | 2010-10-19 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US7685212B2 (en) | 2001-10-29 | 2010-03-23 | Intel Corporation | Fast full search motion estimation with SIMD merge instruction |
US7272622B2 (en) | 2001-10-29 | 2007-09-18 | Intel Corporation | Method and apparatus for parallel shift right merge of data |
US7340495B2 (en) | 2001-10-29 | 2008-03-04 | Intel Corporation | Superior misaligned memory load and copy using merge hardware |
US6914938B2 (en) * | 2002-06-18 | 2005-07-05 | Motorola, Inc. | Interlaced video motion estimation |
-
2002
- 2002-10-25 US US10/280,612 patent/US7685212B2/en not_active Expired - Lifetime
- 2002-10-28 JP JP2003540797A patent/JP4623963B2/en not_active Expired - Lifetime
- 2002-10-28 DE DE10297000T patent/DE10297000B4/en not_active Expired - Lifetime
- 2002-10-28 CN CNB028132483A patent/CN1269027C/en not_active Expired - Lifetime
- 2002-10-28 RU RU2003137531/09A patent/RU2273044C2/en not_active IP Right Cessation
- 2002-10-28 WO PCT/US2002/034404 patent/WO2003038601A1/en active Application Filing
- 2002-10-28 KR KR1020037017220A patent/KR100602532B1/en not_active IP Right Cessation
-
2005
- 2005-02-16 HK HK05101290A patent/HK1068985A1/en not_active IP Right Cessation
-
2008
- 2008-07-28 JP JP2008193844A patent/JP4750157B2/en not_active Expired - Lifetime
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0130380A2 (en) * | 1983-06-30 | 1985-01-09 | International Business Machines Corporation | Mechanism for implementing one machine cycle executable mask and rotate instructions in a primitive instruction set computing system |
EP0363176A2 (en) * | 1988-10-07 | 1990-04-11 | International Business Machines Corporation | Word organised data processors |
US5909572A (en) * | 1996-12-02 | 1999-06-01 | Compaq Computer Corp. | System and method for conditionally moving an operand from a source register to a destination register |
US5933650A (en) * | 1997-10-09 | 1999-08-03 | Mips Technologies, Inc. | Alignment and ordering of vector elements for single instruction multiple data processing |
US6115812A (en) * | 1998-04-01 | 2000-09-05 | Intel Corporation | Method and apparatus for efficient vertical SIMD computations |
Non-Patent Citations (3)
Title |
---|
"BIT-MANIPULATION FACILITY FOR A PARALLEL ARCHITECTURE", IBM TECHNICAL DISCLOSURE BULLETIN, IBM CORP. NEW YORK, US, vol. 34, no. 7A, 1 December 1991 (1991-12-01), pages 387 - 390, XP000255649, ISSN: 0018-8689 * |
LOH W L: "BEE:A SPECIAL-PURPOSE MACHINE FOR HARDWARE DESCRIPTION LANGUAGES", MICROPROCESSORS AND MICROSYSTEMS, IPC BUSINESS PRESS LTD. LONDON, GB, vol. 19, no. 5, 1 June 1995 (1995-06-01), pages 269 - 276, XP000589478, ISSN: 0141-9331 * |
PELEG A ET AL: "MMX TECHNOLOGY EXTENSION TO THE INTEL ARCHITECTURE", IEEE MICRO, IEEE INC. NEW YORK, US, vol. 16, no. 4, 1 August 1996 (1996-08-01), pages 42 - 50, XP000596512, ISSN: 0272-1732 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9182987B2 (en) | 2001-10-29 | 2015-11-10 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US7272622B2 (en) | 2001-10-29 | 2007-09-18 | Intel Corporation | Method and apparatus for parallel shift right merge of data |
US9170814B2 (en) | 2001-10-29 | 2015-10-27 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US9170815B2 (en) | 2001-10-29 | 2015-10-27 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US10732973B2 (en) | 2001-10-29 | 2020-08-04 | Intel Corporation | Processor to execute shift right merge instructions |
US7340495B2 (en) | 2001-10-29 | 2008-03-04 | Intel Corporation | Superior misaligned memory load and copy using merge hardware |
US7685212B2 (en) | 2001-10-29 | 2010-03-23 | Intel Corporation | Fast full search motion estimation with SIMD merge instruction |
US7818356B2 (en) | 2001-10-29 | 2010-10-19 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US8214626B2 (en) | 2001-10-29 | 2012-07-03 | Intel Corporation | Method and apparatus for shuffling data |
US8225075B2 (en) | 2001-10-29 | 2012-07-17 | Intel Corporation | Method and apparatus for shuffling data |
US8510355B2 (en) | 2001-10-29 | 2013-08-13 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US8688959B2 (en) | 2001-10-29 | 2014-04-01 | Intel Corporation | Method and apparatus for shuffling data |
US8745358B2 (en) | 2001-10-29 | 2014-06-03 | Intel Corporation | Processor to execute shift right merge instructions |
US8782377B2 (en) | 2001-10-29 | 2014-07-15 | Intel Corporation | Processor to execute shift right merge instructions |
US10152323B2 (en) | 2001-10-29 | 2018-12-11 | Intel Corporation | Method and apparatus for shuffling data |
US9152420B2 (en) | 2001-10-29 | 2015-10-06 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US10146541B2 (en) | 2001-10-29 | 2018-12-04 | Intel Corporation | Processor to execute shift right merge instructions |
US9477472B2 (en) | 2001-10-29 | 2016-10-25 | Intel Corporation | Method and apparatus for shuffling data |
US9229718B2 (en) | 2001-10-29 | 2016-01-05 | Intel Corporation | Method and apparatus for shuffling data |
US9182988B2 (en) | 2001-10-29 | 2015-11-10 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US9182985B2 (en) | 2001-10-29 | 2015-11-10 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US9189238B2 (en) | 2001-10-29 | 2015-11-17 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US9189237B2 (en) | 2001-10-29 | 2015-11-17 | Intel Corporation | Bitstream buffer manipulation with a SIMD merge instruction |
US9218184B2 (en) | 2001-10-29 | 2015-12-22 | Intel Corporation | Processor to execute shift right merge instructions |
US9229719B2 (en) | 2001-10-29 | 2016-01-05 | Intel Corporation | Method and apparatus for shuffling data |
WO2004040439A3 (en) * | 2002-10-25 | 2004-12-16 | Intel Corp | Method and apparatus for parallel shift right merge of data |
WO2004040439A2 (en) * | 2002-10-25 | 2004-05-13 | Intel Corporation | Method and apparatus for parallel shift right merge of data |
CN1297887C (en) * | 2003-11-28 | 2007-01-31 | 凌阳科技股份有限公司 | Processor and method for trans-boundary aligned multiple transient memory data |
JP2007528545A (en) * | 2004-03-10 | 2007-10-11 | アーム・リミテッド | Apparatus and method for inserting bits into a data word |
JP2015133132A (en) * | 2004-11-03 | 2015-07-23 | インテル コーポレイション | programmable data processing circuit that supports SIMD instruction |
US11720800B2 (en) | 2016-10-04 | 2023-08-08 | Magic Leap, Inc. | Efficient data layouts for convolutional neural networks |
US11360744B2 (en) * | 2017-06-29 | 2022-06-14 | Beijing Qingying Machine Visual Technology Co., Ltd. | Two-dimensional data matching method, device and logic circuit |
US11803377B2 (en) | 2017-09-08 | 2023-10-31 | Oracle International Corporation | Efficient direct convolution using SIMD instructions |
EP4202651A1 (en) * | 2021-12-23 | 2023-06-28 | INTEL Corporation | Apparatus and method for vector packed concatenate and shift of specific portions of quadwords |
WO2024003526A1 (en) * | 2022-06-30 | 2024-01-04 | Arm Limited | Vector extract and merge instruction |
GB2620381A (en) * | 2022-06-30 | 2024-01-10 | Advanced Risc Mach Ltd | Vector extract and merge instruction |
Also Published As
Publication number | Publication date |
---|---|
RU2003137531A (en) | 2005-05-27 |
KR20040038922A (en) | 2004-05-08 |
JP2005508043A (en) | 2005-03-24 |
JP2009009587A (en) | 2009-01-15 |
HK1068985A1 (en) | 2005-05-06 |
JP4750157B2 (en) | 2011-08-17 |
JP4623963B2 (en) | 2011-02-02 |
DE10297000T5 (en) | 2004-07-01 |
DE10297000B4 (en) | 2008-06-05 |
KR100602532B1 (en) | 2006-07-19 |
CN1269027C (en) | 2006-08-09 |
CN1522401A (en) | 2004-08-18 |
US20030123748A1 (en) | 2003-07-03 |
RU2273044C2 (en) | 2006-03-27 |
US7685212B2 (en) | 2010-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7272622B2 (en) | Method and apparatus for parallel shift right merge of data | |
US7685212B2 (en) | Fast full search motion estimation with SIMD merge instruction | |
US10474466B2 (en) | SIMD sign operation | |
US7085795B2 (en) | Apparatus and method for efficient filtering and convolution of content data | |
US7430578B2 (en) | Method and apparatus for performing multiply-add operations on packed byte data | |
US7395298B2 (en) | Method and apparatus for performing multiply-add operations on packed data | |
RU2263947C2 (en) | Integer-valued high order multiplication with truncation and shift in architecture with one commands flow and multiple data flows | |
JP4064989B2 (en) | Device for performing multiplication and addition of packed data | |
US7539714B2 (en) | Method, apparatus, and instruction for performing a sign operation that multiplies | |
US7624138B2 (en) | Method and apparatus for efficient integer transform | |
US6574651B1 (en) | Method and apparatus for arithmetic operation on vectored data | |
US8463837B2 (en) | Method and apparatus for efficient bi-linear interpolation and motion compensation | |
Shahbahrami et al. | Matrix register file and extended subwords: two techniques for embedded media processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2159/DELNP/2003 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2003540797 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 028132483 Country of ref document: CN Ref document number: 1020037017220 Country of ref document: KR |
|
RET | De translation (de og part 6b) |
Ref document number: 10297000 Country of ref document: DE Date of ref document: 20040701 Kind code of ref document: P |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10297000 Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8607 |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8607 |