USRE45458E1 - Dual function system and method for shuffling packed data elements - Google Patents

Dual function system and method for shuffling packed data elements Download PDF

Info

Publication number
USRE45458E1
USRE45458E1 US10/104,205 US10420502A USRE45458E US RE45458 E1 USRE45458 E1 US RE45458E1 US 10420502 A US10420502 A US 10420502A US RE45458 E USRE45458 E US RE45458E
Authority
US
United States
Prior art keywords
data
operand
packed
data elements
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US10/104,205
Inventor
Patrice Roussel
Srinivas Chennupaty
Micheal D. Cranford
Mohammed A. Abdallah
James Coke
Katherine Kong
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US10/104,205 priority Critical patent/USRE45458E1/en
Application granted granted Critical
Publication of USRE45458E1 publication Critical patent/USRE45458E1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7885Runtime interface, e.g. data exchange, runtime control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations

Definitions

  • the present invention relates in general to the field of computer systems, and in particular, to an apparatus and method for performing multidimensional computations based on a shuffle operation.
  • SIMD Single Instruction, Multiple Data
  • One problem with rearranging the order of data elements within a register or memory word is the mechanism used to indicate how the data should be rearranged.
  • a mask or control word is used.
  • the control word must include enough bits to indicate which of the source data fields must be moved into each destination data field. For example, if a source operand has eight data fields, requiring three bits to designate any given data field, and the destination register has four data fields, (3 ⁇ 4) or 12 bits are required for the control word.
  • a full shuffle cannot be supported.
  • the present invention provides an apparatus and method for performing a shuffle operation on packed data using computer-implemented steps is described.
  • a first packed data operand having at least two data elements is accessed.
  • a second packed data operand having at least two data elements is accessed.
  • One of the data elements in the first packed data operand is shuffled into a lower destination field of a destination register, and one of the data elements in the second packed data operand is shuffled into an upper destination field of the destination register.
  • FIG. 1 illustrates an exemplary computer system in accordance with one embodiment of the invention
  • FIG. 2 illustrates the operation of the move instruction in accordance with one embodiment of the invention
  • FIG. 3 illustrates the operation of the shuffle instruction in accordance with one embodiment of the invention
  • FIG. 4 illustrates the operation of the inter-add instruction in accordance with one embodiment of the invention
  • FIG. 5 illustrates the operation of the shuffle instruction in accordance with one embodiment of the invention
  • FIGS. 6a and 6b illustrate the operation of the shuffle instruction in accordance with one embodiment of the invention
  • FIG. 7 is a general block diagram illustrating the usage of a digital filter which utilizes shuffle operations, for filtering a TV broadcast signal in accordance with one embodiment of the invention
  • FIG. 8 is a general block diagram illustrating the use of shuffle operations, in rendering graphical objects in animation.
  • the present invention provides a way to reorganize the order of data elements where less than the full number of bits is available for a control register.
  • a method and apparatus are described for moving data elements in a packed data operand (a shuffle operation).
  • the shuffle operation allows shuffling of certain-sized data into any combination from two source registers or memory into a destination register.
  • the destination register may be the same as a source register.
  • the shuffle instruction is useful in data reorganization and in moving data into different locations of the register to allow, for example, extra storage for scalar operations, or for facilitating the conversion between data formats such as from packed integer to packed floating point and vice versa.
  • registers is used herein to refer to the on-board processor storage locations that are used as part of macroinstructions to identify operands.
  • the registers referred to herein are those that are visible from the outside of the processor (from a programmers perspective).
  • the registers described herein can be implemented by circuitry within a processor using any number of different techniques, such as dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, etc.
  • FIG. 1 illustrates one embodiment of a computer system 100 which implements the principles of the present invention.
  • Computer system 100 comprises a processor 105 , a storage device 110 , and a bus 115 .
  • the processor 105 is coupled to the storage device 110 by the bus 115 .
  • a number of user input/output devices such as a keyboard 120 and a display 125 , are also coupled to the bus 115 .
  • the processor 105 represents a central processing unit of any type of architecture, such as CISC, RISC, VLIW, or hybrid architecture.
  • the processor 105 could be implemented on one or more chips.
  • the storage device 110 represents one or more mechanisms for storing data.
  • the storage device 110 may include read only memory (ROM), random access memory (RAM), magnetic disk storage mediums, optical storage mediums, flash memory devices, and/or other machine-readable mediums.
  • the bus 115 represents one or more buses (e.g., AGP, PCI, ISA, X-Bus, VESA, etc.) and bridges (also termed as bus controllers). While this embodiment is described in relation to a single processor computer system, the invention could be implemented in a multi-processor computer system. In addition, while this embodiment is described in relation to a 64-bit computer system, the invention is not limited to a 64-bit computer system.
  • a network 130 In addition to other devices, one or more of a network 130 , a TV broadcast signal receiver 131 , a fax/modem 132 , a digitizing unit 133 , a sound unit 134 , and a graphics unit 135 may optionally be coupled to bus 115 .
  • the network 130 and fax modem 132 represent one or more network connections for transmitting data over a machine readable media (e.g., carrier waves).
  • the digitizing unit 133 represents one or more devices for digitizing images (i.e., a scanner, camera, etc.).
  • the sound unit 134 represents one or more devices for inputting and/or outputting sound (e.g., microphones, speakers, magnetic storage devices, optical storage devices, etc.).
  • the graphics unit 135 represents one or more devices for generating 3-D images (e.g., graphics card).
  • FIG. 1 also illustrates that the storage device 110 has stored therein data 135 and software 136 .
  • Data 135 represents data stored in one or more of the formats described herein.
  • Software 136 represents the necessary code for performing any and/or all of the techniques described with reference to FIGS. 3-6 . It will be recognized by one of ordinary skill in the art that the storage device 110 preferably contains additional software (not shown), which is not necessary to understanding the invention.
  • FIG. 1 additionally illustrates that the processor 105 includes decode unit 140 , a set of registers 141 , and execution unit 142 , and an internal bus 143 for executing instructions. It will be recognized by one of ordinary skill in the art that the processor 105 contains additional circuitry, which is not necessary to understanding the invention.
  • the decode unit 140 , registers 141 and execution unit 142 are coupled together by internal bus 143 .
  • the decode unit 140 is used for decoding instructions received by processor 105 into control signals and/or microcode entry points. In response to these control signals and/or microcode entry points, the execution unit 142 performs the appropriate operations.
  • the decode unit 140 may be implemented using any number of different mechanisms (e.g., a look-up table, a hardware implementation, a PLA, etc.). While the decoding of the various instructions is represented herein by a series of if/then statements, it is understood that the execution of an instruction does not require a serial processing of these if/then statements. Rather, any mechanism for logically performing this if/then processing is considered to be within the scope of the implementation of the invention.
  • the decode unit 140 is shown including packed data instruction set 145 for performing operations on packed data.
  • the packed data instruction set 145 includes the following instructions: a move instruction(s) 150 , a shuffle instruction(s) 155 , an add instruction(s) (such as ADDPS) 160 , and a multiply instruction(s) 165 .
  • the MOVAPS, SHUFPS and ADDPS instructions are applicable to packed floating point data, in which the results of an operation between two sets of numbers having a predetermined number of bits, are stored in a register having the same predetermined number of bits, i.e., the size or configuration of the operand is the same as that of the result register.
  • the operation of each of these instructions is further described herein. While one embodiment is described in which the packed data instructions operate on floating point data, alternative embodiments could alternatively or additionally have similar instructions that operate on integer data.
  • processor 105 can include new instructions and/or instructions similar to or the same as those found in existing general purpose processors.
  • the processor 105 supports an instruction set which is compatible with the Intel® Architecture instruction set used by existing processors, such as the Pentium® II processor.
  • Alternative embodiments of the invention may contain more or less, as well as different, packed data instructions and still utilize the teachings of the invention.
  • the registers 141 represent a storage area on processor 105 for storing information, including control/status information, integer data, floating point data, and packed data. It will be understood by one of ordinary skill in the art that one aspect of the invention is the described instruction set for operating on packed data. According to this aspect of the invention, the storage area used for storing the packed data is not critical.
  • the term data processing system is used herein to refer to any machine for processing data, including the computer systems(s) described with reference to FIG. 1 .
  • packed data can be operated on in one of three formats: a “packed byte” format (e.g., PADDb), a “packed word” format (e.g., PADDw), or a “packed double word” (dword) format (e.g., PADDd).
  • the packed byte format includes eight separate 8-bit data elements the packed word format includes four separate 16-bit data elements; the packed dword format includes two separate 32-bit data elements. While certain instructions are discussed below with reference to one or two packed data formats, the instructions may be similarly applied the other packed data formats of the invention.
  • FIG. 2 illustrates the operation of the move instruction 150 according to one embodiment of the invention.
  • the move instruction 150 moves bits of data from one register to another register or from one memory location to another.
  • 64-bits representing four packed words are moved from one memory location to another or from one register to another.
  • FIG. 3 illustrates the operation of the shuffle instruction 155 according to one embodiment of the invention.
  • the shuffle instruction 155 (SHUFPS) is able to shuffle any one of a plurality (e.g., four) single floating point (FP) numbers from a first operand 310 to the lower two destination fields of a destination register 330 ; the upper two destination fields are generated from a shuffle of any one of a plurality (e.g., four) single FP numbers from a second operand 320 .
  • FP floating point
  • FIG. 4 illustrates the operation of the packed vertical add instruction 160 according to one embodiment of the invention.
  • the packed vertical operation is the add instruction (ADDPS) 160 , which operates on the data elements of a first to operand 410 and a second operand 420 .
  • the data elements of a first operand 410 are added to the respective packed data elements of a second operand 420 , and are used to generate a result 430 .
  • data element 0 of the first operand 410 is added to data element 0 of the second operand 420 and the result is stored as data element 0 of the result 430 .
  • the packed multiply instruction acts in a similar manner to the packed add instruction, except multiply operations are performed.
  • FIG. 5 illustrates a technique for performing a shuffle operation on two numbers according to one embodiment of the invention.
  • data is represented by ovals, while instructions are represented by rectangles.
  • the process S 500 proceeds to process step S 510 , where numbers X0, X1, X2 and X3 are stored as data elements in a packed data item 515 .
  • each data element is 16-bits wide and is contained in register X0, in the following order:
  • each data element is 16-bits wide and is contained in register X1, in the following order:
  • the process S 500 then advances to process step S 530 , where a shuffle instruction is performed on the contents of register X0 (data item 515 ) and register X1 (data item 525 ) to shuffle any one of the four data elements from the first data item 515 to the lower two fields of a destination register 535 , and to shuffle any one of the four data elements from the second data item 525 to the upper two fields of the destination register 535 .
  • the resulting data item 535 is as follows:
  • FIG. 5 illustrates an example of the shuffle operation with data operands having four data elements
  • the principles of the invention may also be implemented in data operands having at least two elements.
  • An 8-bit immediate value is used as a control word to indicate how data elements should be shuffled.
  • Bits 0 , 1 of the control word indicate which of the four data elements in the first operand are shuffled into the first or lowest data element of the destination register.
  • Bits 2 , 3 of the control word indicate which of the four data elements in the first operand are shuffled into the second data element of the destination register.
  • Bits 4 , 5 of the control word indicate which of the four data elements in the second operand are shuffled into the third data element of the destination register.
  • Bits 6 , 7 of the control word indicate which of the four data elements in the second operand are shuffled into the fourth data element of the destination register.
  • the size of the shuffle control word may vary depending without loss of compatibility with the present invention, depending on the number of data elements in the source data operand and the number of fields in the destination register.
  • FIG. 6a illustrates a schematic for performing a shuffle operation on two numbers according to one embodiment of the invention.
  • the device 600 reads the contents of a first source packed data operand 605 and a second source packed data operand 610 .
  • a four to one data multiplexer 615 shuffles any one of data elements ⁇ D,C,B,A ⁇ from either data operands 605 , 610 into the lower field of destination data item 625 .
  • a four to one data multiplexer 620 shuffles any one of data elements ⁇ D,C,B,A ⁇ from either data operands 605 , 610 into the upper field of destination data item 625 .
  • FIG. 6b illustrates a diagram for performing a shuffle operation on two numbers according to one embodiment of the invention.
  • the device 630 reads the contents of a first source packed data operand 635 . Any one of data elements ⁇ D,C,B,A ⁇ from the data operand 635 are shuffled into the lower two fields of destination data item 645 .
  • the device 630 then reads the contents of a second source packed data operand 640 . Any one of data elements ⁇ H,G,F,E ⁇ from the data operand 640 are shuffled into the upper two fields of destination data item 645 .
  • the first source data operand 635 may be the same as the second source data operand 640 . This method of shuffling may be performed with only an 8-bit control word.
  • FIGS. 6a and 6b illustrate an example of the shuffle operation with data operands having two data elements
  • the principles of the invention may also be implemented in data operands having more than two elements.
  • FIG. 7 is a general block diagram illustrating the use of a digital filter which utilizes a shuffle operation for filtering a TV broadcast signal according to one embodiment of the invention.
  • FIG. 7 shows TV broadcast signals 703 representing a television broadcast being received by a receiving unit 706 of a computer system 700 .
  • the receiving unit 706 receives the TV broadcast signals 703 and transforms them into digital data 709 .
  • a digital filter unit 715 performs a digital filter (e.g., FIR, IIR, etc.) on the digital data 709 using a set of coefficients 712 .
  • a digital filter e.g., FIR, IIR, etc.
  • the digital filter unit 715 generates filtered data 718 (also termed as “filtered data items”) representing the filtered analog TV broadcast signals.
  • filtered data items also termed as “filtered data items” representing the filtered analog TV broadcast signals.
  • shuffle operations are implemented.
  • the filtered data 718 are received by a video decoder 721 for conversion into and audio & video data 724 .
  • the techniques performed by video decoder 721 are well known (see Jack, Smith, Keith, “NTSC/PAL Digital Decoder”, Video Demystified, High Text Publications, Inc., 1993)
  • the audio and video data can be used for any purpose (e.g., display on a screen).
  • the computer system 100 shown in FIG. 1 is used to implement the computer system 700 in FIG. 7 .
  • the TV broadcast signal receiver 131 acts as the receiving unit 706 and may include a TV tuner, an analog to digital converter, and a DMA channel.
  • the TV broadcast signals 703 are received by the TV tuner, converted into digital data by the analog to digital converter, and then sorted in the storage device 110 by the DMA channel. It will be recognized by one of ordinary skill in the art that the digital data sorted by the TV broadcast signal receiver 131 may be stored in any number of formats.
  • the TV broadcast signal receiver 131 may store the data in the main memory in one or more of the formats described herein—storing two representations of each of the components of the data such that it may be read in as packed data item in the described formats. This data may then be accessed as packed data and copied into registers on the processor 105 . Since the data is stored in the disclosed formats, the processor 105 can easily and efficiently perform the shuffle operation as described with reference to FIG. 5 and FIG. 6 . It will be recognized by one of ordinary skill in the art that the receiving unit 706 may encompass additional hardware, software, and/or firmware in the TV broadcast signal receiver 131 or software executing on the processor 105 . For example, additional software may be sorted in the storage device 110 for further processing the data prior to the digital filter being performed.
  • the digital filter unit 718 is implemented using the processor 105 and the software 136 to perform the a digital filter.
  • the processor 105 executing the software 136 , performs the digital filter using shuffle operations, and stores the filtered data 718 in storage device 110 .
  • the digital filter is performed by the host processor of the computer system, rather than the TV broadcast signal receiver 131 .
  • the video decoder 721 may be implemented in any number of different combinations of hardware, software, and/or firmware.
  • the audio and video data 724 can then be sorted, and/or displayed on the display 125 and the sound unit 134 , respectively.
  • FIG. 8 is a general block diagram illustrating the use of a shuffle operation for rendering graphical objects in animation according to one embodiment of the invention.
  • FIG. 8 shows a computer system 800 containing digital data 810 representing 3-dimensional (3D) graphics.
  • the digital data 810 may be stored on a CD ROM or other type of storage device for later use.
  • the conversion unit 820 performs alteration of data using 3D geometry which includes the use of a shuffle operation to manipulate (e.g., scale, rotate, etc.) a 3D object in providing animation.
  • the resulting graphical object 830 is then displayed on a screen display 840 .
  • the resulting graphical object may also be transmitted to a recording device (e.g., magnetic storage, such as tape).
  • a recording device e.g., magnetic storage, such as tape
  • the computer system 100 shown in FIG. 1 is used to perform the graphics operation 800 from FIG. 8 .
  • the digital data 810 from FIG. 8 is any data stored in the storage device 110 representing 3D graphics.
  • the conversion unit 820 from FIG. 8 is implemented using the processor 105 and the software 136 to alter data using 3D geometry.
  • An example of such alteration of data includes the performance of a 3D transformation.
  • the processor 105 executing the software 136 , performs the transformation and stores the transformed data 830 in the storage device 110 and/or provide, the transformed data to the graphics unit 135 . In this manner, the 3D manipulation performed by the host processor of the computer system is provided at an increased speed.
  • the present invention thus facilitates the performance of a shuffle operation through the use of available instruction sequences.

Abstract

An apparatus and method for performing a shuffle operation on packed data using computer-implemented steps is described. In one embodiment, a first packed data operand having at least two data elements is accessed. A second packed data operand having at least two data elements is accessed. One of the data elements in the first packed data operand is shuffled into a lower destination field of a destination register, and one of the data elements in the second packed data operand is shuffled into an upper destination field of the destination register.

Description

More than one reissue application has been filed for the reissue of U.S. Pat. No. 6,041,404, which is hereby incorporated by reference in its entirety. The reissue applications are application Ser. No. 10/104,205 (the present and parent reissue application) and Ser. No. 14/283,020 which is a reissue continuation of application Ser. No. 10/104,205.
FIELD OF THE INVENTION
The present invention relates in general to the field of computer systems, and in particular, to an apparatus and method for performing multidimensional computations based on a shuffle operation.
BACKGROUND OF THE INVENTION
To improve the efficiency of multimedia applications, as well as other applications with similar characteristics, a Single Instruction, Multiple Data (SIMD) architecture has been implemented in computer systems to enable one instruction to operate on several data simultaneously, rather than on a single data. In particular, SIMD architectures take advantage of packing many data elements within one register or memory location. With parallel hardware execution, multiple operations can be performed with one instruction, resulting in significant performance improvement.
Although many applications currently in use can take advantage of such operations, known as vertical operations, there are a number of important applications which would require the rearrangement of the data elements before vertical operations can be implemented so as to provide realization of the application. Examples of such important applications include the dot product and matrix multiplication operations, which are commonly used in 3-D graphics and signal processing applications.
One problem with rearranging the order of data elements within a register or memory word is the mechanism used to indicate how the data should be rearranged. Typically, a mask or control word is used. The control word must include enough bits to indicate which of the source data fields must be moved into each destination data field. For example, if a source operand has eight data fields, requiring three bits to designate any given data field, and the destination register has four data fields, (3×4) or 12 bits are required for the control word. However, on a processor implementation where there are less than 12 bits available for the control register, a full shuffle cannot be supported.
Therefore, there is a need for a way to reorganize the order of data elements where less than the full number of bits is available for a control register.
SUMMARY OF THE INVENTION
The present invention provides an apparatus and method for performing a shuffle operation on packed data using computer-implemented steps is described. In one embodiment, a first packed data operand having at least two data elements is accessed. A second packed data operand having at least two data elements is accessed. One of the data elements in the first packed data operand is shuffled into a lower destination field of a destination register, and one of the data elements in the second packed data operand is shuffled into an upper destination field of the destination register.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like references indicate similar elements and in which:
FIG. 1 illustrates an exemplary computer system in accordance with one embodiment of the invention;
FIG. 2 illustrates the operation of the move instruction in accordance with one embodiment of the invention;
FIG. 3 illustrates the operation of the shuffle instruction in accordance with one embodiment of the invention;
FIG. 4 illustrates the operation of the inter-add instruction in accordance with one embodiment of the invention;
FIG. 5 illustrates the operation of the shuffle instruction in accordance with one embodiment of the invention;
FIGS. 6a and 6b illustrate the operation of the shuffle instruction in accordance with one embodiment of the invention;
FIG. 7 is a general block diagram illustrating the usage of a digital filter which utilizes shuffle operations, for filtering a TV broadcast signal in accordance with one embodiment of the invention;
FIG. 8 is a general block diagram illustrating the use of shuffle operations, in rendering graphical objects in animation.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE PRESENT INVENTION
In the following description, numerous specific details are set forth to provide a thorough understanding of the invention. However, it will be understood by one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the invention.
The present invention provides a way to reorganize the order of data elements where less than the full number of bits is available for a control register. According to one aspect of the invention, a method and apparatus are described for moving data elements in a packed data operand (a shuffle operation). The shuffle operation allows shuffling of certain-sized data into any combination from two source registers or memory into a destination register. The destination register may be the same as a source register. The shuffle instruction is useful in data reorganization and in moving data into different locations of the register to allow, for example, extra storage for scalar operations, or for facilitating the conversion between data formats such as from packed integer to packed floating point and vice versa.
The term “registers” is used herein to refer to the on-board processor storage locations that are used as part of macroinstructions to identify operands. In other words, the registers referred to herein are those that are visible from the outside of the processor (from a programmers perspective). However, the registers described herein can be implemented by circuitry within a processor using any number of different techniques, such as dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, etc.
COMPUTER SYSTEM
FIG. 1 illustrates one embodiment of a computer system 100 which implements the principles of the present invention. Computer system 100 comprises a processor 105, a storage device 110, and a bus 115. The processor 105 is coupled to the storage device 110 by the bus 115. In addition, a number of user input/output devices, such as a keyboard 120 and a display 125, are also coupled to the bus 115. The processor 105 represents a central processing unit of any type of architecture, such as CISC, RISC, VLIW, or hybrid architecture. In addition, the processor 105 could be implemented on one or more chips. The storage device 110 represents one or more mechanisms for storing data. For example, the storage device 110 may include read only memory (ROM), random access memory (RAM), magnetic disk storage mediums, optical storage mediums, flash memory devices, and/or other machine-readable mediums. The bus 115 represents one or more buses (e.g., AGP, PCI, ISA, X-Bus, VESA, etc.) and bridges (also termed as bus controllers). While this embodiment is described in relation to a single processor computer system, the invention could be implemented in a multi-processor computer system. In addition, while this embodiment is described in relation to a 64-bit computer system, the invention is not limited to a 64-bit computer system.
In addition to other devices, one or more of a network 130, a TV broadcast signal receiver 131, a fax/modem 132, a digitizing unit 133, a sound unit 134, and a graphics unit 135 may optionally be coupled to bus 115. The network 130 and fax modem 132 represent one or more network connections for transmitting data over a machine readable media (e.g., carrier waves). The digitizing unit 133 represents one or more devices for digitizing images (i.e., a scanner, camera, etc.). The sound unit 134 represents one or more devices for inputting and/or outputting sound (e.g., microphones, speakers, magnetic storage devices, optical storage devices, etc.). The graphics unit 135 represents one or more devices for generating 3-D images (e.g., graphics card).
FIG. 1 also illustrates that the storage device 110 has stored therein data 135 and software 136. Data 135 represents data stored in one or more of the formats described herein. Software 136 represents the necessary code for performing any and/or all of the techniques described with reference to FIGS. 3-6. It will be recognized by one of ordinary skill in the art that the storage device 110 preferably contains additional software (not shown), which is not necessary to understanding the invention.
FIG. 1 additionally illustrates that the processor 105 includes decode unit 140, a set of registers 141, and execution unit 142, and an internal bus 143 for executing instructions. It will be recognized by one of ordinary skill in the art that the processor 105 contains additional circuitry, which is not necessary to understanding the invention. The decode unit 140, registers 141 and execution unit 142 are coupled together by internal bus 143. The decode unit 140 is used for decoding instructions received by processor 105 into control signals and/or microcode entry points. In response to these control signals and/or microcode entry points, the execution unit 142 performs the appropriate operations. The decode unit 140 may be implemented using any number of different mechanisms (e.g., a look-up table, a hardware implementation, a PLA, etc.). While the decoding of the various instructions is represented herein by a series of if/then statements, it is understood that the execution of an instruction does not require a serial processing of these if/then statements. Rather, any mechanism for logically performing this if/then processing is considered to be within the scope of the implementation of the invention.
The decode unit 140 is shown including packed data instruction set 145 for performing operations on packed data. In one embodiment, the packed data instruction set 145 includes the following instructions: a move instruction(s) 150, a shuffle instruction(s) 155, an add instruction(s) (such as ADDPS) 160, and a multiply instruction(s) 165. The MOVAPS, SHUFPS and ADDPS instructions are applicable to packed floating point data, in which the results of an operation between two sets of numbers having a predetermined number of bits, are stored in a register having the same predetermined number of bits, i.e., the size or configuration of the operand is the same as that of the result register. The operation of each of these instructions is further described herein. While one embodiment is described in which the packed data instructions operate on floating point data, alternative embodiments could alternatively or additionally have similar instructions that operate on integer data.
In addition to the packed data instructions, processor 105 can include new instructions and/or instructions similar to or the same as those found in existing general purpose processors. For example, in one embodiment the processor 105 supports an instruction set which is compatible with the Intel® Architecture instruction set used by existing processors, such as the Pentium® II processor. Alternative embodiments of the invention may contain more or less, as well as different, packed data instructions and still utilize the teachings of the invention.
The registers 141 represent a storage area on processor 105 for storing information, including control/status information, integer data, floating point data, and packed data. It will be understood by one of ordinary skill in the art that one aspect of the invention is the described instruction set for operating on packed data. According to this aspect of the invention, the storage area used for storing the packed data is not critical. The term data processing system is used herein to refer to any machine for processing data, including the computer systems(s) described with reference to FIG. 1.
While one embodiment of the invention is described in which the processor 105, executing the packed data instructions operates on 128-bit packed data operands containing four 32-bit single precision floating point values, the processor 105 can operate on packed data in several different packed data formats. For example, in one embodiment, packed data can be operated on in one of three formats: a “packed byte” format (e.g., PADDb), a “packed word” format (e.g., PADDw), or a “packed double word” (dword) format (e.g., PADDd). The packed byte format includes eight separate 8-bit data elements the packed word format includes four separate 16-bit data elements; the packed dword format includes two separate 32-bit data elements. While certain instructions are discussed below with reference to one or two packed data formats, the instructions may be similarly applied the other packed data formats of the invention.
The shuffle instruction of the present invention is part of a family of many different instructions which operate with SIMD architecture. For example, FIG. 2 illustrates the operation of the move instruction 150 according to one embodiment of the invention. In this example, the move instruction 150 (MOVAPS) moves bits of data from one register to another register or from one memory location to another. In one embodiment, 64-bits representing four packed words are moved from one memory location to another or from one register to another.
FIG. 3 illustrates the operation of the shuffle instruction 155 according to one embodiment of the invention. In one embodiment, the shuffle instruction 155 (SHUFPS) is able to shuffle any one of a plurality (e.g., four) single floating point (FP) numbers from a first operand 310 to the lower two destination fields of a destination register 330; the upper two destination fields are generated from a shuffle of any one of a plurality (e.g., four) single FP numbers from a second operand 320.
FIG. 4 illustrates the operation of the packed vertical add instruction 160 according to one embodiment of the invention. In one embodiment, the packed vertical operation is the add instruction (ADDPS) 160, which operates on the data elements of a first to operand 410 and a second operand 420. In particular, the data elements of a first operand 410 are added to the respective packed data elements of a second operand 420, and are used to generate a result 430. For example, data element 0 of the first operand 410 is added to data element 0 of the second operand 420 and the result is stored as data element 0 of the result 430. The packed multiply instruction acts in a similar manner to the packed add instruction, except multiply operations are performed.
SHUFFLE OPERATION
FIG. 5 illustrates a technique for performing a shuffle operation on two numbers according to one embodiment of the invention. In this application, data is represented by ovals, while instructions are represented by rectangles. Beginning from a start state, the process S500 proceeds to process step S510, where numbers X0, X1, X2 and X3 are stored as data elements in a packed data item 515. For present discussion purposes, each data element is 16-bits wide and is contained in register X0, in the following order:
|X3|X2|X1X|X0|
|X3|X2|X1|X0|
The process S500 then proceeds to process step S520, where numbers Y0, Y1, Y2 and Y3 are stored as data elements in a packed data item 525. For present discussion purposes, each data element is 16-bits wide and is contained in register X1, in the following order:
|Y3|Y2|Y1|Y0|
The process S500 then advances to process step S530, where a shuffle instruction is performed on the contents of register X0 (data item 515) and register X1 (data item 525) to shuffle any one of the four data elements from the first data item 515 to the lower two fields of a destination register 535, and to shuffle any one of the four data elements from the second data item 525 to the upper two fields of the destination register 535. The resulting data item 535 is as follows:
|{Y3, Y2, Y1, Y0}|{Y3, Y2, Y1, Y0}|{X3, X2, X1, X0}|{X3, X2, X1, X0}|
Accordingly, a shuffle operation is performed. Although FIG. 5 illustrates an example of the shuffle operation with data operands having four data elements, the principles of the invention may also be implemented in data operands having at least two elements.
An 8-bit immediate value is used as a control word to indicate how data elements should be shuffled. Bits 0,1 of the control word indicate which of the four data elements in the first operand are shuffled into the first or lowest data element of the destination register. Bits 2,3 of the control word indicate which of the four data elements in the first operand are shuffled into the second data element of the destination register. Bits 4,5 of the control word indicate which of the four data elements in the second operand are shuffled into the third data element of the destination register. Bits 6,7 of the control word indicate which of the four data elements in the second operand are shuffled into the fourth data element of the destination register. For example, given a first data operand with four data elements contained in the following order:
|D|C|B|A|
and also given a second data operand with four data elements contained in the following order:
|H|G|F|E|
and also given a shuffle control word of 10001111, the result of the shuffle is as follows:
|G|E|D|D|
It will be recognized by one of ordinary skill in the art that the size of the shuffle control word may vary depending without loss of compatibility with the present invention, depending on the number of data elements in the source data operand and the number of fields in the destination register.
FIG. 6a illustrates a schematic for performing a shuffle operation on two numbers according to one embodiment of the invention. The device 600 reads the contents of a first source packed data operand 605 and a second source packed data operand 610. A four to one data multiplexer 615 shuffles any one of data elements {D,C,B,A} from either data operands 605, 610 into the lower field of destination data item 625. A four to one data multiplexer 620 shuffles any one of data elements {D,C,B,A} from either data operands 605, 610 into the upper field of destination data item 625.
FIG. 6b illustrates a diagram for performing a shuffle operation on two numbers according to one embodiment of the invention. The device 630 reads the contents of a first source packed data operand 635. Any one of data elements {D,C,B,A} from the data operand 635 are shuffled into the lower two fields of destination data item 645. The device 630 then reads the contents of a second source packed data operand 640. Any one of data elements {H,G,F,E} from the data operand 640 are shuffled into the upper two fields of destination data item 645. The first source data operand 635 may be the same as the second source data operand 640. This method of shuffling may be performed with only an 8-bit control word.
Accordingly, a shuffle operation is performed. Although FIGS. 6a and 6b illustrate an example of the shuffle operation with data operands having two data elements, the principles of the invention may also be implemented in data operands having more than two elements.
The shuffle instruction of the present invention may be used as part of many different applications. For example, FIG. 7 is a general block diagram illustrating the use of a digital filter which utilizes a shuffle operation for filtering a TV broadcast signal according to one embodiment of the invention. FIG. 7 shows TV broadcast signals 703 representing a television broadcast being received by a receiving unit 706 of a computer system 700. The receiving unit 706 receives the TV broadcast signals 703 and transforms them into digital data 709. A digital filter unit 715 performs a digital filter (e.g., FIR, IIR, etc.) on the digital data 709 using a set of coefficients 712. As a result, the digital filter unit 715 generates filtered data 718 (also termed as “filtered data items”) representing the filtered analog TV broadcast signals. In performing the filtering operation, shuffle operations are implemented. The filtered data 718 are received by a video decoder 721 for conversion into and audio & video data 724. The techniques performed by video decoder 721 are well known (see Jack, Smith, Keith, “NTSC/PAL Digital Decoder”, Video Demystified, High Text Publications, Inc., 1993) The audio and video data can be used for any purpose (e.g., display on a screen).
In one embodiment, the computer system 100 shown in FIG. 1 is used to implement the computer system 700 in FIG. 7. In this embodiment, the TV broadcast signal receiver 131 acts as the receiving unit 706 and may include a TV tuner, an analog to digital converter, and a DMA channel. The TV broadcast signals 703 are received by the TV tuner, converted into digital data by the analog to digital converter, and then sorted in the storage device 110 by the DMA channel. It will be recognized by one of ordinary skill in the art that the digital data sorted by the TV broadcast signal receiver 131 may be stored in any number of formats. For example, the TV broadcast signal receiver 131 may store the data in the main memory in one or more of the formats described herein—storing two representations of each of the components of the data such that it may be read in as packed data item in the described formats. This data may then be accessed as packed data and copied into registers on the processor 105. Since the data is stored in the disclosed formats, the processor 105 can easily and efficiently perform the shuffle operation as described with reference to FIG. 5 and FIG. 6. It will be recognized by one of ordinary skill in the art that the receiving unit 706 may encompass additional hardware, software, and/or firmware in the TV broadcast signal receiver 131 or software executing on the processor 105. For example, additional software may be sorted in the storage device 110 for further processing the data prior to the digital filter being performed.
In this embodiment, the digital filter unit 718 is implemented using the processor 105 and the software 136 to perform the a digital filter. In this embodiment, the processor 105, executing the software 136, performs the digital filter using shuffle operations, and stores the filtered data 718 in storage device 110. In this manner, the digital filter is performed by the host processor of the computer system, rather than the TV broadcast signal receiver 131. As a result, the complexity of the TV broadcast signal receiver 131 is reduced. In this embodiment, the video decoder 721 may be implemented in any number of different combinations of hardware, software, and/or firmware. The audio and video data 724 can then be sorted, and/or displayed on the display 125 and the sound unit 134, respectively.
FIG. 8 is a general block diagram illustrating the use of a shuffle operation for rendering graphical objects in animation according to one embodiment of the invention. FIG. 8 shows a computer system 800 containing digital data 810 representing 3-dimensional (3D) graphics. The digital data 810 may be stored on a CD ROM or other type of storage device for later use. At sometime, the conversion unit 820 performs alteration of data using 3D geometry which includes the use of a shuffle operation to manipulate (e.g., scale, rotate, etc.) a 3D object in providing animation. The resulting graphical object 830 is then displayed on a screen display 840. The resulting graphical object may also be transmitted to a recording device (e.g., magnetic storage, such as tape).
In one embodiment, the computer system 100 shown in FIG. 1 is used to perform the graphics operation 800 from FIG. 8. In this embodiment, the digital data 810 from FIG. 8 is any data stored in the storage device 110 representing 3D graphics. In one embodiment, the conversion unit 820 from FIG. 8 is implemented using the processor 105 and the software 136 to alter data using 3D geometry. An example of such alteration of data includes the performance of a 3D transformation. In this embodiment, the processor 105, executing the software 136, performs the transformation and stores the transformed data 830 in the storage device 110 and/or provide, the transformed data to the graphics unit 135. In this manner, the 3D manipulation performed by the host processor of the computer system is provided at an increased speed. The present invention thus facilitates the performance of a shuffle operation through the use of available instruction sequences.
While several examples uses of shuffle operations have been described, it will be understood by one of ordinary skill in the art that the invention is not limited to these uses. In addition, while the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described. The method and apparatus of the invention can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting on the invention.

Claims (56)

What is claimed is:
1. A computer system comprising:
a hardware unit to transmit data representing graphics to another computer or a display;
a processor coupled to the hardware unit; and
a storage device coupled to the processor and having stored therein an instruction, which when executed by the processor, causes the processor to at least,
access a first packed data operand having at least two data elements;
access a second packed data operand having at least two data elements;
select a first set of data elements from the first packed data operand;
copy each of the data elements in the first set to specified data fields located in the tower half of a destination operand;
select a second set of data elements from the second packed data operand; and
copy each of the data elements in the second set to specified data fields located in the upper half of the destination operand.
2. The computer system of claim 1 wherein the storage device further comprises a packing device for packing floating point data into the data elements.
3. The computer system of claim 1 wherein the storage device further comprises a packing device for packing integer data into the data elements.
4. A system as claimed in claim 1 wherein the first and second packed data operands are the same operand.
5. A method comprising the computer-implemented steps of:
decoding a single instruction;
in response to the step of decoding the single instruction,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand; and
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand.
6. The method of claim 5 further comprising the step of packing floating point data into the data elements.
7. The method of claim 5 further comprising the step of packing integer data into the data elements.
8. A method as claimed in claim 5 wherein the first and second packed data operands are the same operand.
9. A method comprising the computer implemented steps of:
accessing data representative of a first three-dimensional image;
altering the data using three-dimensional geometry to generate a second
three-dimensional image, the step of altering at least including,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand;
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand; and
displaying the second three-dimensional image.
10. The method of claim 9 wherein the step of altering includes the performance of a three-dimensional transformation.
11. The method of claim 9 wherein the step of altering includes the step of packing floating point data into the data elements.
12. The method of claim 9 wherein the step of altering includes the step of packing integer data into the data elements.
13. A method as claimed in claim 9 wherein the first and second packed data operands are the same operand.
14. A method comprising the computer implemented steps of:
accessing data representative of a first three-dimensional image;
altering the data using three-dimensional geometry to generate a second three-dimensional image, the step of altering at least including,
accessing a first packed data operand having at least two data elements;
accessing a second packed data operand having at least two data elements;
selecting a first set of data elements from the first packed data operand;
copying each of the data elements in the first set to specified data fields located in the lower half of a destination operand;
selecting a second set of data elements from the second packed data operand;
copying each of the data elements in the second set to specified data fields located in the upper half of the destination operand; and
displaying the second three-dimensional image.
15. The method of claim 14 wherein the step of altering includes the performance of a three-dimensional transformation.
16. The method of claim 14 wherein the step of altering includes the step of packing floating point data into the data elements.
17. The method of claim 14 wherein the step of altering includes the step of packing integer data into the data elements.
18. A method as claimed in claim 14 wherein the first and second packed data operands are the same operand.
19. A processor-implemented method for reducing the number of control hits required to shuffle packed data elements from first and second source operands, comprising the steps of:
decoding a single instruction specifying first and second source operands and a field of control bits; and
responsive to the field of control bits, generating a resultant packed data operand comprised of packed data elements from the first and second source operands,
wherein the control bits are limited to specifying for the upper and lower halves of the resultant packed data operand, data elements from the first and second source operands, respectively.
20. The method as claimed in claim 19 wherein the first and second packed data source operands and the resultant packed data operand are comprised of four packed data elements, and the field of control bits is an 8-bit field.
21. The method as claimed in claim 19 wherein the first and second packed data source operands are the same operand.
22. The method as claimed in claim 19 wherein the first and second packed data source operands are packed with floating point data.
23. A processor for performing a shuffle operation in response to a shuffle instruction comprising:
a decoder which decodes a single instruction specifying first and second source operands and a field of control bits; and
an execution unit which, responsive to the field of control bits, generates a resultant packed data operand comprised of packed data elements from the first and second source operands,
wherein the control bits are limited to specifying for the upper and lower halves of the resultant packed data operand, data elements from the first and second source operands, respectively.
24. The processor as claimed in claim 23 wherein the first and second source operands are the same operand.
25. The method as claimed in claim 19 wherein the first and second packed data source operands and the resultant packed data operand are each comprised of at least two packed data elements.
26. The method as claimed in claim 19 wherein the field of control bits is an 8-bit field.
27. The method as claimed in claim 26 wherein an 8-bit immediate to fill the field of control bits is decoded with the single instruction.
28. The processor of claim 23 wherein said field of control bits comprises of an 8-bit immediate value.
29. The processor of claim 23 wherein said field of control bits comprises of an 8-bits.
30. The processor of claim 29 wherein said first and second source operands comprise of double-precision floating-point values.
31. The processor of claim 29 wherein said first and second source operands comprise single-precision floating-point values.
32. The processor of claim 29 wherein said packed data elements comprise of packed double words.
33. The processor of claim 29 wherein said packed data elements comprise of packed words.
34. The processor of claim 29 wherein said packed data elements comprise of packed bytes.
35. The processor of claim 29 wherein said first and said second operands comprise of 128-bits of packed data.
36. An apparatus comprising:
a decode unit to decode a shuffle instruction into control signals, said shuffle instruction to include a first operand, a second operand, and a third operand wherein said third operand comprises of an 8-bit immediate value;
said first operand to identify a first register to hold at least two packed data elements;
said second operand to identify a memory location to hold at least two packed data elements;
said third operand is to provide selection bits to indicate which of said packed data elements in said first operand and said second operand to select and copy to a resultant register; and
an execution unit coupled to said decode unit, said execution unit responsive to said control signals and said selection bits to select a first set of data elements from said first register and to copy said first set of data elements to one or more lower destination fields of said resultant register, said execution unit further responsive to said control signals and said selection bits to select a second set of data elements from said memory location and to copy said second set of data elements to one or more upper destination fields of said resultant register.
37. The apparatus of claim 36 wherein said data elements of said first register and said second register comprise double-precision floating-point values.
38. The apparatus of claim 36 wherein said data elements of said first register and said second register comprise of single-precision floating-point values.
39. The apparatus of claim 36 wherein said packed data elements comprise of packed double words.
40. The apparatus of claim 36 wherein said packed data elements comprise of packed words.
41. The apparatus of claim 36 wherein said packed data elements comprise of packed bytes.
42. The apparatus of claim 36 wherein said first register is also said resultant register.
43. An apparatus comprising:
an instruction decoder to receive and decode a shuffle instruction, said shuffle instruction to include an immediate operand comprising two or more sets of control bits;
a first source register to hold a first packed data, said first packed data comprising of a first data element and a second data element;
a second source register to hold a second packed data, said second packed data comprising of a third data element and a fourth data element;
a destination register to hold a third packed data;
an execution unit coupled to said first source resister to receive said first packed data, and to said second source register to receive said second packed data; and
wherein said execution unit is further coupled to said instruction decoder to receive said two or more sets of control bits, said execution unit to select from said first source register at least one of said first and second data elements in response to a first one of said two or more sets of control bits and to copy said selected data element from said first source register to a first data field in a lower half of said destination register, and said execution unit to select from said second source register at least one of said third and fourth data elements in response to a second one of said two or more sets of control bits and to copy said selected data element from said second source register to a second data field in an upper half of said destination register.
44. The apparatus of claim 43 wherein said immediate operand is an 8-bit immediate operand.
45. The apparatus of claim 43 wherein said data elements of said first source register and said second source register comprise of double-precision floating-point values.
46. The apparatus of claim 43 wherein said data elements of said first source register and said second source register comprise of single-precision floating-point values.
47. The apparatus of claim 43 wherein said packed data comprise of packed double words.
48. The apparatus of claim 43 wherein said packed data comprise of packed words.
49. The apparatus of claim 43 wherein said packed data comprise of packed bytes.
50. The apparatus of claim 43 wherein said apparatus is defined by machine readable data on a machine readable medium.
51. The apparatus of claim 43 wherein said first source register is also said destination register.
52. The apparatus of claim 43 wherein said first source register is the same as said second source register.
53. The apparatus of claim 43 wherein said two or more sets of control bits comprise bits 0 and 1 of the immediate operand.
54. The apparatus of claim 44 wherein said 8-bit immediate operand comprises bits 0 and 1 to select from said first source register which data element is copied into the lowest data field in the lower half of the destination register, and bits 4 and 5 to select from said second source register which data element is copied into the lowest data field in the upper half of the destination register.
55. The apparatus of claim 44 wherein said 8-bit immediate operand comprises bits 0 through 3 to select from said first source register which data elements are copied into the lower half of the destination register, and bits 4 through 7 to select from said second source register which data elements are copied into the upper half of the destination register.
56. The apparatus of claim 55 wherein said 8-bit immediate operand comprises bits 2 and 3 to select from said first source register which data element is copied into the highest data field in the lower half of the destination register, and bits 6 and 7 to select from said second source register which data element is copied into the highest data field in the upper half of the destination register.
US10/104,205 1998-03-31 2002-03-21 Dual function system and method for shuffling packed data elements Expired - Lifetime USRE45458E1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/104,205 USRE45458E1 (en) 1998-03-31 2002-03-21 Dual function system and method for shuffling packed data elements

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/052,996 US6041404A (en) 1998-03-31 1998-03-31 Dual function system and method for shuffling packed data elements
US10/104,205 USRE45458E1 (en) 1998-03-31 2002-03-21 Dual function system and method for shuffling packed data elements

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/052,996 Reissue US6041404A (en) 1998-03-31 1998-03-31 Dual function system and method for shuffling packed data elements

Publications (1)

Publication Number Publication Date
USRE45458E1 true USRE45458E1 (en) 2015-04-07

Family

ID=21981242

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/052,996 Expired - Lifetime US6041404A (en) 1998-03-31 1998-03-31 Dual function system and method for shuffling packed data elements
US10/104,205 Expired - Lifetime USRE45458E1 (en) 1998-03-31 2002-03-21 Dual function system and method for shuffling packed data elements

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/052,996 Expired - Lifetime US6041404A (en) 1998-03-31 1998-03-31 Dual function system and method for shuffling packed data elements

Country Status (6)

Country Link
US (2) US6041404A (en)
EP (1) EP0947918A3 (en)
CN (1) CN1158613C (en)
BR (1) BR9901145A (en)
SG (1) SG77228A1 (en)
TW (1) TW452735B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9959247B1 (en) 2017-02-17 2018-05-01 Google Llc Permuting in a matrix-vector processor

Families Citing this family (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5953241A (en) 1995-08-16 1999-09-14 Microunity Engeering Systems, Inc. Multiplier array processing system with enhanced utilization at lower precision for group multiply and sum instruction
US6643765B1 (en) * 1995-08-16 2003-11-04 Microunity Systems Engineering, Inc. Programmable processor with group floating point operations
US6295599B1 (en) * 1995-08-16 2001-09-25 Microunity Systems Engineering System and method for providing a wide operand architecture
US7301541B2 (en) * 1995-08-16 2007-11-27 Microunity Systems Engineering, Inc. Programmable processor and method with wide operations
US6041404A (en) 1998-03-31 2000-03-21 Intel Corporation Dual function system and method for shuffling packed data elements
US7392275B2 (en) * 1998-03-31 2008-06-24 Intel Corporation Method and apparatus for performing efficient transformations with horizontal addition and subtraction
US7395302B2 (en) 1998-03-31 2008-07-01 Intel Corporation Method and apparatus for performing horizontal addition and subtraction
US7133040B1 (en) * 1998-03-31 2006-11-07 Intel Corporation System and method for performing an insert-extract instruction
US6418529B1 (en) * 1998-03-31 2002-07-09 Intel Corporation Apparatus and method for performing intra-add operation
ATE557342T1 (en) * 1998-08-24 2012-05-15 Microunity Systems Eng PROCESSOR AND METHOD FOR MATRIX MULTIPLICATION WITH A WIDE OPERAND
US7932911B2 (en) * 1998-08-24 2011-04-26 Microunity Systems Engineering, Inc. Processor for executing switch and translate instructions requiring wide operands
US7155601B2 (en) * 2001-02-14 2006-12-26 Intel Corporation Multi-element operand sub-portion shuffle instruction execution
US7818356B2 (en) 2001-10-29 2010-10-19 Intel Corporation Bitstream buffer manipulation with a SIMD merge instruction
US7725521B2 (en) * 2001-10-29 2010-05-25 Intel Corporation Method and apparatus for computing matrix transformations
US7631025B2 (en) * 2001-10-29 2009-12-08 Intel Corporation Method and apparatus for rearranging data between multiple registers
US20040054877A1 (en) 2001-10-29 2004-03-18 Macy William W. Method and apparatus for shuffling data
US7685212B2 (en) * 2001-10-29 2010-03-23 Intel Corporation Fast full search motion estimation with SIMD merge instruction
US7624138B2 (en) 2001-10-29 2009-11-24 Intel Corporation Method and apparatus for efficient integer transform
US7739319B2 (en) * 2001-10-29 2010-06-15 Intel Corporation Method and apparatus for parallel table lookup using SIMD instructions
US7853778B2 (en) 2001-12-20 2010-12-14 Intel Corporation Load/move and duplicate instructions for a processor
US7047383B2 (en) * 2002-07-11 2006-05-16 Intel Corporation Byte swap operation for a 64 bit operand
US8122227B2 (en) * 2004-11-03 2012-02-21 Silicon Hive B.V. SIMD processor for performing data filtering and/or interpolation
CN101059865B (en) * 2006-04-18 2010-05-12 太极控股有限公司 Instruction puckering mechanism and its method, and pixel processing system using same
US7761694B2 (en) * 2006-06-30 2010-07-20 Intel Corporation Execution unit for performing shuffle and other operations
US8417922B2 (en) * 2006-08-02 2013-04-09 Qualcomm Incorporated Method and system to combine multiple register units within a microprocessor
US8502819B1 (en) 2007-12-17 2013-08-06 Nvidia Corporation System and method for performing ray tracing node traversal in image rendering
US8289324B1 (en) 2007-12-17 2012-10-16 Nvidia Corporation System, method, and computer program product for spatial hierarchy traversal
US8909901B2 (en) * 2007-12-28 2014-12-09 Intel Corporation Permute operations with flexible zero control
US8078836B2 (en) 2007-12-30 2011-12-13 Intel Corporation Vector shuffle instructions operating on multiple lanes each having a plurality of data elements using a common set of per-lane control bits
US9652231B2 (en) * 2008-10-14 2017-05-16 International Business Machines Corporation All-to-all permutation of vector elements based on a permutation pattern encoded in mantissa and exponent bits in a floating-point SIMD architecture
US8327344B2 (en) * 2008-10-14 2012-12-04 International Business Machines Corporation Array reference safety analysis in the presence of loops with conditional control flow
US8423983B2 (en) * 2008-10-14 2013-04-16 International Business Machines Corporation Generating and executing programs for a floating point single instruction multiple data instruction set architecture
US9086872B2 (en) * 2009-06-30 2015-07-21 Intel Corporation Unpacking packed data in multiple lanes
US8555036B1 (en) 2010-05-17 2013-10-08 Nvidia Corporation System and method for performing predicated selection of an output register
US8564589B1 (en) 2010-05-17 2013-10-22 Nvidia Corporation System and method for accelerated ray-box intersection testing
CN102109978A (en) * 2011-02-28 2011-06-29 孙瑞琛 Data rearrangement method and rearrangement device
WO2013089749A1 (en) * 2011-12-15 2013-06-20 Intel Corporation Methods to optimize a program loop via vector instructions using a shuffle table and a mask store table
CN104011646B (en) * 2011-12-22 2018-03-27 英特尔公司 For processor, method, system and the instruction of the sequence for producing the continuous integral number according to numerical order
US10223112B2 (en) 2011-12-22 2019-03-05 Intel Corporation Processors, methods, systems, and instructions to generate sequences of integers in which integers in consecutive positions differ by a constant integer stride and where a smallest integer is offset from zero by an integer offset
CN104011644B (en) 2011-12-22 2017-12-08 英特尔公司 Processor, method, system and instruction for generation according to the sequence of the integer of the phase difference constant span of numerical order
CN107391086B (en) 2011-12-23 2020-12-08 英特尔公司 Apparatus and method for improving permute instruction
WO2013095620A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved insert instructions
CN108241504A (en) 2011-12-23 2018-07-03 英特尔公司 The device and method of improved extraction instruction
CN107220029B (en) 2011-12-23 2020-10-27 英特尔公司 Apparatus and method for mask permute instruction
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
US9395988B2 (en) 2013-03-08 2016-07-19 Samsung Electronics Co., Ltd. Micro-ops including packed source and destination fields
US9766888B2 (en) 2014-03-28 2017-09-19 Intel Corporation Processor instruction to store indexes of source data elements in positions representing a sorted order of the source data elements
US9785565B2 (en) 2014-06-30 2017-10-10 Microunity Systems Engineering, Inc. System and methods for expandably wide processor instructions
EP3001307B1 (en) 2014-09-25 2019-11-13 Intel Corporation Bit shuffle processors, methods, systems, and instructions
CN105511838B (en) * 2014-09-29 2018-06-29 上海兆芯集成电路有限公司 Processor and its execution method
US9898286B2 (en) * 2015-05-05 2018-02-20 Intel Corporation Packed finite impulse response (FIR) filter processors, methods, systems, and instructions
US10503502B2 (en) 2015-09-25 2019-12-10 Intel Corporation Data element rearrangement, processors, methods, systems, and instructions
US10678545B2 (en) 2016-07-07 2020-06-09 Texas Instruments Incorporated Data processing apparatus having streaming engine with read and read/advance operand coding
CN111010883B (en) 2018-08-06 2022-07-12 华为技术有限公司 Matrix processing method and device and logic circuit

Citations (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3711692A (en) 1971-03-15 1973-01-16 Goodyear Aerospace Corp Determination of number of ones in a data field by addition
US3723715A (en) 1971-08-25 1973-03-27 Ibm Fast modulo threshold operator binary adder for multi-number additions
US4139899A (en) 1976-10-18 1979-02-13 Burroughs Corporation Shift network having a mask generator and a rotator
US4161784A (en) 1978-01-05 1979-07-17 Honeywell Information Systems, Inc. Microprogrammable floating point arithmetic unit capable of performing arithmetic operations on long and short operands
US4393468A (en) 1981-03-26 1983-07-12 Advanced Micro Devices, Inc. Bit slice microprogrammable processor for signal processing applications
US4418383A (en) 1980-06-30 1983-11-29 International Business Machines Corporation Data flow component for processor and microprocessor systems
US4498177A (en) 1982-08-30 1985-02-05 Sperry Corporation M Out of N code checker circuit
US4707800A (en) 1985-03-04 1987-11-17 Raytheon Company Adder/substractor for variable length numbers
US4771379A (en) 1985-10-23 1988-09-13 Mitsubishi Denki Kabushiki Kaisha Digital signal processor with parallel multipliers
US4903228A (en) 1988-11-09 1990-02-20 International Business Machines Corporation Single cycle merge/logic unit
US4989168A (en) 1987-11-30 1991-01-29 Fujitsu Limited Multiplying unit in a computer system, capable of population counting
US5019968A (en) 1988-03-29 1991-05-28 Yulan Wang Three-dimensional vector processor
US5081698A (en) 1989-02-14 1992-01-14 Intel Corporation Method and apparatus for graphics display data manipulation
US5095457A (en) 1989-02-02 1992-03-10 Samsung Electronics Co., Ltd. Digital multiplier employing CMOS transistors
US5168571A (en) 1990-01-24 1992-12-01 International Business Machines Corporation System for aligning bytes of variable multi-bytes length operand based on alu byte length and a number of unprocessed byte data
US5187679A (en) 1991-06-05 1993-02-16 International Business Machines Corporation Generalized 7/3 counters
US5268995A (en) 1990-11-21 1993-12-07 Motorola, Inc. Method for executing graphics Z-compare and pixel merge instructions in a data processor
US5321810A (en) 1991-08-21 1994-06-14 Digital Equipment Corporation Address method for computer graphics system
US5327543A (en) * 1987-09-10 1994-07-05 Hitachi Ltd System for selectively masking operand portions for processing thereof
US5390135A (en) 1993-11-29 1995-02-14 Hewlett-Packard Parallel shift and add circuit and method
US5408670A (en) 1992-12-18 1995-04-18 Xerox Corporation Performing arithmetic in parallel on composite operands with packed multi-bit components
US5423010A (en) 1992-01-24 1995-06-06 C-Cube Microsystems Structure and method for packing and unpacking a stream of N-bit data to and from a stream of N-bit data words
US5426783A (en) 1992-11-02 1995-06-20 Amdahl Corporation System for processing eight bytes or less by the move, pack and unpack instruction of the ESA/390 instruction set
US5465374A (en) 1993-01-12 1995-11-07 International Business Machines Corporation Processor for processing data string by byte-by-byte
US5487159A (en) * 1993-12-23 1996-01-23 Unisys Corporation System for processing shift, mask, and merge operations in one instruction
US5497497A (en) 1989-11-03 1996-03-05 Compaq Computer Corp. Method and apparatus for resetting multiple processors using a common ROM
US5579253A (en) 1994-09-02 1996-11-26 Lee; Ruby B. Computer multiply instruction with a subresult selection option
US5594437A (en) 1994-08-01 1997-01-14 Motorola, Inc. Circuit and method of unpacking a serial bitstream
WO1997007450A1 (en) 1995-08-16 1997-02-27 Microunity Systems Engineering, Inc. General purpose, programmable media processor
WO1997009671A1 (en) 1995-09-01 1997-03-13 Philips Electronics North America Corporation Method and apparatus for custom operations of a processor
US5625374A (en) 1993-09-07 1997-04-29 Apple Computer, Inc. Method for parallel interpolation of images
WO1997032278A1 (en) 1996-02-15 1997-09-04 Intel Corporation Computer implemented method for compressing 48-bit pixels to 16-bit pixels
US5680161A (en) 1991-04-03 1997-10-21 Radius Inc. Method and apparatus for high speed graphics data compression
US5781457A (en) 1994-03-08 1998-07-14 Exponential Technology, Inc. Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU
US5802336A (en) 1994-12-02 1998-09-01 Intel Corporation Microprocessor capable of unpacking packed data
US5819117A (en) * 1995-10-10 1998-10-06 Microunity Systems Engineering, Inc. Method and system for facilitating byte ordering interfacing of a computer system
US5881259A (en) * 1996-09-23 1999-03-09 Arm Limited Input operand size and hi/low word selection control in data processing systems
US5909572A (en) 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
US5931945A (en) * 1994-04-29 1999-08-03 Sun Microsystems, Inc. Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits
US5933650A (en) * 1997-10-09 1999-08-03 Mips Technologies, Inc. Alignment and ordering of vector elements for single instruction multiple data processing
US6002881A (en) * 1997-06-10 1999-12-14 Arm Limited Coprocessor data access control
US6041404A (en) 1998-03-31 2000-03-21 Intel Corporation Dual function system and method for shuffling packed data elements
US6058465A (en) * 1996-08-19 2000-05-02 Nguyen; Le Trong Single-instruction-multiple-data processing in a multimedia signal processor
US6115812A (en) 1998-04-01 2000-09-05 Intel Corporation Method and apparatus for efficient vertical SIMD computations
US6223277B1 (en) * 1997-11-21 2001-04-24 Texas Instruments Incorporated Data processing circuit with packed data structure capability
US6243808B1 (en) 1999-03-08 2001-06-05 Chameleon Systems, Inc. Digital data bit order conversion using universal switch matrix comprising rows of bit swapping selector groups
US6381690B1 (en) * 1995-08-01 2002-04-30 Hewlett-Packard Company Processor for performing subword permutations and combinations

Patent Citations (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3711692A (en) 1971-03-15 1973-01-16 Goodyear Aerospace Corp Determination of number of ones in a data field by addition
US3723715A (en) 1971-08-25 1973-03-27 Ibm Fast modulo threshold operator binary adder for multi-number additions
US4139899A (en) 1976-10-18 1979-02-13 Burroughs Corporation Shift network having a mask generator and a rotator
US4161784A (en) 1978-01-05 1979-07-17 Honeywell Information Systems, Inc. Microprogrammable floating point arithmetic unit capable of performing arithmetic operations on long and short operands
US4418383A (en) 1980-06-30 1983-11-29 International Business Machines Corporation Data flow component for processor and microprocessor systems
US4393468A (en) 1981-03-26 1983-07-12 Advanced Micro Devices, Inc. Bit slice microprogrammable processor for signal processing applications
US4498177A (en) 1982-08-30 1985-02-05 Sperry Corporation M Out of N code checker circuit
US4707800A (en) 1985-03-04 1987-11-17 Raytheon Company Adder/substractor for variable length numbers
US4771379A (en) 1985-10-23 1988-09-13 Mitsubishi Denki Kabushiki Kaisha Digital signal processor with parallel multipliers
US5327543A (en) * 1987-09-10 1994-07-05 Hitachi Ltd System for selectively masking operand portions for processing thereof
US4989168A (en) 1987-11-30 1991-01-29 Fujitsu Limited Multiplying unit in a computer system, capable of population counting
US5019968A (en) 1988-03-29 1991-05-28 Yulan Wang Three-dimensional vector processor
US4903228A (en) 1988-11-09 1990-02-20 International Business Machines Corporation Single cycle merge/logic unit
US5095457A (en) 1989-02-02 1992-03-10 Samsung Electronics Co., Ltd. Digital multiplier employing CMOS transistors
US5081698A (en) 1989-02-14 1992-01-14 Intel Corporation Method and apparatus for graphics display data manipulation
US5497497A (en) 1989-11-03 1996-03-05 Compaq Computer Corp. Method and apparatus for resetting multiple processors using a common ROM
US5168571A (en) 1990-01-24 1992-12-01 International Business Machines Corporation System for aligning bytes of variable multi-bytes length operand based on alu byte length and a number of unprocessed byte data
US5268995A (en) 1990-11-21 1993-12-07 Motorola, Inc. Method for executing graphics Z-compare and pixel merge instructions in a data processor
US5680161A (en) 1991-04-03 1997-10-21 Radius Inc. Method and apparatus for high speed graphics data compression
US5187679A (en) 1991-06-05 1993-02-16 International Business Machines Corporation Generalized 7/3 counters
US5321810A (en) 1991-08-21 1994-06-14 Digital Equipment Corporation Address method for computer graphics system
US5423010A (en) 1992-01-24 1995-06-06 C-Cube Microsystems Structure and method for packing and unpacking a stream of N-bit data to and from a stream of N-bit data words
US5426783A (en) 1992-11-02 1995-06-20 Amdahl Corporation System for processing eight bytes or less by the move, pack and unpack instruction of the ESA/390 instruction set
US5408670A (en) 1992-12-18 1995-04-18 Xerox Corporation Performing arithmetic in parallel on composite operands with packed multi-bit components
US5465374A (en) 1993-01-12 1995-11-07 International Business Machines Corporation Processor for processing data string by byte-by-byte
US5625374A (en) 1993-09-07 1997-04-29 Apple Computer, Inc. Method for parallel interpolation of images
US5390135A (en) 1993-11-29 1995-02-14 Hewlett-Packard Parallel shift and add circuit and method
US5487159A (en) * 1993-12-23 1996-01-23 Unisys Corporation System for processing shift, mask, and merge operations in one instruction
US5781457A (en) 1994-03-08 1998-07-14 Exponential Technology, Inc. Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU
US5931945A (en) * 1994-04-29 1999-08-03 Sun Microsystems, Inc. Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits
US5594437A (en) 1994-08-01 1997-01-14 Motorola, Inc. Circuit and method of unpacking a serial bitstream
US5579253A (en) 1994-09-02 1996-11-26 Lee; Ruby B. Computer multiply instruction with a subresult selection option
US5802336A (en) 1994-12-02 1998-09-01 Intel Corporation Microprocessor capable of unpacking packed data
US6381690B1 (en) * 1995-08-01 2002-04-30 Hewlett-Packard Company Processor for performing subword permutations and combinations
WO1997007450A1 (en) 1995-08-16 1997-02-27 Microunity Systems Engineering, Inc. General purpose, programmable media processor
WO1997009671A1 (en) 1995-09-01 1997-03-13 Philips Electronics North America Corporation Method and apparatus for custom operations of a processor
US5819117A (en) * 1995-10-10 1998-10-06 Microunity Systems Engineering, Inc. Method and system for facilitating byte ordering interfacing of a computer system
WO1997032278A1 (en) 1996-02-15 1997-09-04 Intel Corporation Computer implemented method for compressing 48-bit pixels to 16-bit pixels
US6058465A (en) * 1996-08-19 2000-05-02 Nguyen; Le Trong Single-instruction-multiple-data processing in a multimedia signal processor
US5881259A (en) * 1996-09-23 1999-03-09 Arm Limited Input operand size and hi/low word selection control in data processing systems
US5909572A (en) 1996-12-02 1999-06-01 Compaq Computer Corp. System and method for conditionally moving an operand from a source register to a destination register
US6002881A (en) * 1997-06-10 1999-12-14 Arm Limited Coprocessor data access control
US5933650A (en) * 1997-10-09 1999-08-03 Mips Technologies, Inc. Alignment and ordering of vector elements for single instruction multiple data processing
US6223277B1 (en) * 1997-11-21 2001-04-24 Texas Instruments Incorporated Data processing circuit with packed data structure capability
US6041404A (en) 1998-03-31 2000-03-21 Intel Corporation Dual function system and method for shuffling packed data elements
US6115812A (en) 1998-04-01 2000-09-05 Intel Corporation Method and apparatus for efficient vertical SIMD computations
US6243808B1 (en) 1999-03-08 2001-06-05 Chameleon Systems, Inc. Digital data bit order conversion using universal switch matrix comprising rows of bit swapping selector groups

Non-Patent Citations (52)

* Cited by examiner, † Cited by third party
Title
"MIPS Digital Media Extension," Set Architecture Specification, http:/-/www.mips.com/MDMXspec.ps (Oct. 21, 1997), 8 pages.
"Visual Instruction Set (VIS(TM)) User's Guide", Version 1.1, Mar. 1997, pp. i-vii & 1-136. *
"Visual Instruction Set (VIS) User's Guide", Version 1.1, Mar. 1997, pp. i-xii & 1-136. *
"Visual Instruction Set (VIS™) User's Guide", Version 1.1, Mar. 1997, pp. i-vii & 1-136. *
Abbott, et al., "Broadband Algorithms with the MicroUnity Mediaprocessor," MicroUnity Systems Engineering, Inc., Proceedings of Compcon, IEEE, 1996, pp. 349-354.
Advanced Micro Devices, Inc., "AMD-3D Technology Manual," Feb. 1998, pp. 1-58.
Austrian Search Report, Appln. No. 9901342-7, Oct. 31, 2000, 7 pages.
Case, B., "Philips Hopes to Displace DSPs with VLIW, TriMedia Processors Aimed at Future Multimedia Embedded Apps," Microprocessor Report, Dec. 1994, pp. 12-18.
Craig Hansen, "Architecture of a Broadband Mediaprocessor" 1996 IEEE Proceedings of COMPCON '96, pp. 334-340.
Diefendorff, K., et al., "AltiVec Extension to PowerPC Accelerates Media Processing," IEEE, #0272-1732/00, 2000, pp. 85-95.
Errata to MC88110 Second Generation RISC Microprocessor User's Manual, Motorola, Inc., 1992, pp. 1-11.
European Search Report, EP 99 30 2378, Mar. 14, 2000, 3 pages.
Gwennap, L., "New PA-RISC Processor Decodes MPEG Video, H"'s PA-7100LC Uses New Instructions to Eliminate Decoder Chip, Microprocessor Report, Jan. 1994, pp. 16-17.
Hansen, C., "Architecture of a Broadband Mediaprocessor," Proceedings of Compcon, IEEE, 1996, pp. 334-340.
Hansen,C, MicroUnity's MediaProcessor Architecture, 1996, IEEE, pp. 34-41. *
Hayes, et al., "MicroUnity Software Development Environment," MicroUnity Systems Engineering, Inc., Proceedings of Compcon, IEEE, 1996, pp. 341-348.
Hewlet Packard, "64-bit and Multimedia Extensions in the PA-RISC 2.0 Architecture," Microprocessors Precision Architecture, 1997, 18 pages.
i860 TM. Microprocessor Family Programmer's Reference Manual, Intel Corporation, 1992, Chapters 1, 3, 8, and 12.
Intel Corporation, "IA-32 Intel® Architecture Software Developer's Manual, vol. I: Basic Architecture," 2002, 21 pgs. total.
Intel Corporation, "IA-32 Intel® Architecture Software Developer's Manual, vol. II: Instruction Set Reference," 2002, 19 pgs. total.
Intel Corporation, "Inte1486(TM) Microprocessor Family Programmer's Reference Manual," 1992, 44 pgs. total.
Intel Corporation, "Intel Architecture Software Developer's Manual, vol. 2; Instruction Set Reference," 1999, 26 pages.
Intel Corporation, "Intel Architecture Software Developer's Manual, vol. 2; Instruction Set Reference," 1999, 26 pgs.
Intel Corporation, "Intel® Itanium(TM) Architecture Software Developer's Manual, vol. 3: Instruction Set Reference," Rev. 2.0, Dec. 2001, 30 pgs. total.
Intel Corporation, "Pentium® Processor Family Developer's Manual, vol. 3: Architecture and Programming Manual," 1995, 54 pgs. total.
Intel Corporation, "Pentium® Processor User's Manual, vol. 3: Architecture and Programming Manual," 1993, 50 pgs. total.
Intel Corporation, "Willamette Processor Developer's Guide," Manual, Feb. 2000, 16 pages.
Intel Corporation, "Inte1486™ Microprocessor Family Programmer's Reference Manual," 1992, 44 pgs. total.
Intel Corporation, "Intel® Itanium™ Architecture Software Developer's Manual, vol. 3: Instruction Set Reference," Rev. 2.0, Dec. 2001, 30 pgs. total.
Intel i750, i860 TM, i960 Processors and Related Products, 1993, pp. 1-3.
Kawakami, Y., et al., "A Single-Chip Digital Signal Processor for Voiceband Applications," IEEE, 1980 International Solid-State Circuits Conference, pp. 40-41.
Lee, R.B., "Accelerating Multimedia with Enhanced Microprocessors," IEEE Micro, Apr. 1995, pp. 22-32.
Levinthal, et al., "Chap-A SIMD Graphics Processor," Computer Graphics Project, ACM, vol. 18, No. 3, Jul. 1984, pp. 77-81.
Levinthal, et al., "Parallel Computers for Graphics Applications," Proceedings: Second Int'l. Conf. on Architectural Support for Programming Languages and Operating Systems, (ASPLOS II), IEEE, 1987, pp. 193-198.
Mano, Morris M. , Computer System Architecture 1982, Prentice Hall, 2d Ed. pp. 140-144. *
Margulis, N., "i860 Microprocessor Architecture," McGraw Hill, Inc., 1990, Chapters 6, 7, 8, 10, and 11.
MC88110 Programmer's Reference Guide, Motorola, Inc., 1992, pp. 1-4.
MC88110 Second Generation-RISC Microprocessor User's Manual, Motorola, Inc., Sep. 1992, pp. 2-1 through 2-22, 3-1 through 3-32, 5-1 through 5-25, 10-62 through 10-71, Index 1 through 17.
Motorola MC88110 Second Generation RISC Microprocessor User's Manual, Motorola, Inc., 1991.
Motorola, MC68020 32-bit Microprocessor User's Manual 1985, Prentice Hall, 2d Ed. pp. B-101 -B-103, B-135, B-136, B-169,B-170. *
Peleg, A, etal, Intel MMX for Multimedia PCs, Jan. 1997, Communications of the ACM, vol. 40, No. 1, pp. 25-38. *
Pentium Processor's User's Manual, vol. 3: Architecture and Programming Manual, Intel Corporation, 1993, Chapters 1, 3, 4, 6, 8, and 18.
Shipnes, J., "Graphics Processing with the 88110 RISC Microprocessor," Motorola, Inc., IEEE, No. 0-8186-26455-0/92, 1992, pp. 169-174.
Silicon Graphics, "Silicon Graphics Introduces Compact MIPS® RISC Microprocessor Code for High Performance at a Low Cost," Oct. 21, 1996, 13 pages.
Sun Microsystems, Ultrasparc(TM) The Visual Instruction Set (VIS(TM)): On Chip Support for New-Media Processing, Whitepaper 95-022, 1996, 7 pages.
Sun Microsystems, Ultrasparc™ The Visual Instruction Set (VIS™): On Chip Support for New-Media Processing, Whitepaper 95-022, 1996, 7 pages.
TMS320c2X, User's Guide, Digital Signal Processing Products, Texas Instruments, 1993, pp. 3-2-3-11; 3-28-3-34; 4-1-4-22; 4-41; 4-103; 4-119; 4-120; 4-122; 4-150; 4-151.
Tri-Media, "TM1000 Preliminary Data Book," Phillips Electronics No. Amer., 1997, 30 pages.
U.S. Appl. No. 13/732,243, filed Dec. 31, 2012.
U.S. Appl. No. 14/283,020, filed May 20, 2014.
U1traSPARC Multimedia Capabilities On-Chip Support for Real0-Time Video and Advanced Graphics; SPARC Technology Business, Sep. 1994, Sun Microsystems, Inc.
Wang, et al., "A Processor Architecture for 3D Graphics Calculations," Computer Motion, Inc., Goleta, CA, 23 pgs.

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9959247B1 (en) 2017-02-17 2018-05-01 Google Llc Permuting in a matrix-vector processor
US10216705B2 (en) 2017-02-17 2019-02-26 Google Llc Permuting in a matrix-vector processor
US10592583B2 (en) 2017-02-17 2020-03-17 Google Llc Permuting in a matrix-vector processor
US10614151B2 (en) 2017-02-17 2020-04-07 Google Llc Permuting in a matrix-vector processor
US10956537B2 (en) 2017-02-17 2021-03-23 Google Llc Permuting in a matrix-vector processor
US11748443B2 (en) 2017-02-17 2023-09-05 Google Llc Permuting in a matrix-vector processor

Also Published As

Publication number Publication date
CN1158613C (en) 2004-07-21
EP0947918A2 (en) 1999-10-06
TW452735B (en) 2001-09-01
BR9901145A (en) 1999-12-21
CN1234560A (en) 1999-11-10
US6041404A (en) 2000-03-21
EP0947918A3 (en) 2000-04-26
SG77228A1 (en) 2000-12-19

Similar Documents

Publication Publication Date Title
USRE45458E1 (en) Dual function system and method for shuffling packed data elements
US7155601B2 (en) Multi-element operand sub-portion shuffle instruction execution
US10831477B2 (en) In-lane vector shuffle instructions
US6211892B1 (en) System and method for performing an intra-add operation
US9015453B2 (en) Packing odd bytes from two source registers of packed data
US5666298A (en) Method for performing shift operations on packed data
US7133040B1 (en) System and method for performing an insert-extract instruction
US5815421A (en) Method for transposing a two-dimensional array
US5935240A (en) Computer implemented method for transferring packed data between register files and memory
US5951624A (en) Computer system to compress pixel bits
US6212627B1 (en) System for converting packed integer data into packed floating point data in reduced time
US20140189311A1 (en) System and method for performing a shuffle instruction
US6307553B1 (en) System and method for performing a MOVHPS-MOVLPS instruction

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY