US20030105945A1 - Methods and apparatus for a bit rake instruction - Google Patents
Methods and apparatus for a bit rake instruction Download PDFInfo
- Publication number
- US20030105945A1 US20030105945A1 US10/282,919 US28291902A US2003105945A1 US 20030105945 A1 US20030105945 A1 US 20030105945A1 US 28291902 A US28291902 A US 28291902A US 2003105945 A1 US2003105945 A1 US 2003105945A1
- Authority
- US
- United States
- Prior art keywords
- bit
- mask
- bits
- programmable apparatus
- path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 241001442055 Vipera berus Species 0.000 claims description 41
- 238000012545 processing Methods 0.000 claims description 8
- 238000012856 packing Methods 0.000 claims 3
- 230000010076 replication Effects 0.000 claims 2
- 239000000284 extract Substances 0.000 abstract description 2
- 230000015654 memory Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 9
- 230000000717 retained effect Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 1
- 101150012579 ADSL gene Proteins 0.000 description 1
- 102100020775 Adenylosuccinate lyase Human genes 0.000 description 1
- 108700040193 Adenylosuccinate lyases Proteins 0.000 description 1
- 101100434502 Oryza sativa subsp. japonica AGO7 gene Proteins 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004513 sizing Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30018—Bit or string instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
Definitions
- the present invention relates generally to improvements in computational processing. More specifically, the present invention relates to a system and method for providing a bit rake instruction to extract a pattern of bits from a source register.
- the present invention provides a programmable system and method for performing a bit rake instruction which extracts an arbitrary pattern of bits from a source register, based on a mask provided in another register, and packs and right justifies the bits into a target register.
- the bit rake instruction allows any set of bits from the source register to be packed together.
- FIG. 1 illustrates an exemplary ManArray DSP and DMA subsystem appropriate for use with this invention
- FIG. 2A shows an exemplary encoding of a bit rake instruction in accordance with the present invention
- FIG. 2B shows an exemplary operation of a bit rake instruction in accordance with the present invention
- FIG. 2C shows syntax and operation of a bit rake instruction in accordance with the present invention
- FIGS. 3A and 3B show diagrams of a bit rake apparatus in accordance with the present invention
- FIG. 4 shows the sorting of groups of asserted mask bits in accordance with the present invention
- FIG. 5 shows a right-shift to left-shift example in accordance with the present invention
- FIG. 6 shows a 3-level shifter in accordance with the present invention
- FIG. 7 shows a data path diagram in accordance with the present invention
- FIG. 8 shows an adder tree in accordance with the present invention
- FIG. 9A shows a data path structure in accordance with the present invention
- FIG. 9B shows a shifter and multiplexer stage in accordance with the present invention.
- FIG. 10 shows a diagram of a bit rake instruction apparatus in accordance with the present invention
- Provisional Application Serial No. 60/288,965 filed May 4, 2001
- Provisional Application Serial No. 60/298,624 filed Jun. 15, 2001
- Provisional Application Serial No. 60/298,695 filed Jun. 15, 2001
- Provisional Application Serial No. 60/298,696 filed Jun. 15, 2001
- Provisional Application Serial No. 60/318,745 filed Sep. 11, 2001
- Provisional Application Serial No. ______ entitled “Methods and Apparatus for Video Coding” filed Oct. 30, 2001 all of which are assigned to the assignee of the present invention and incorporated by reference herein in their entirety.
- a ManArray 2 ⁇ 2 iVLIW single instruction multiple data stream (SIMD) processor 100 as shown in FIG. 1 may be adapted as described further below for use in conjunction with the present invention.
- Processor 100 comprises a sequence processor (SP) controller combined with a processing element- 0 (PE 0 ) to form an SP/PE 0 combined unit 101 , as described in further detail in U.S. patent application Ser. No. 09/169,072 entitled “Methods and Apparatus for Dynamically Merging an Array Controller with an Array Processing Element”.
- the SP/PE 0 101 contains an instruction fetch (I-fetch) controller 103 to allow the fetching of “short” instruction words (SIW) or abbreviated-instruction words from a B-bit instruction memory 105 , where B is determined by the application instruction-abbreviation process to be a reduced number of bits representing ManArray native instructions and/or to contain two or more abbreviated instructions as described in the present invention.
- I-fetch instruction fetch
- the fetch controller 103 provides the typical functions needed in a programmable processor, such as a program counter (PC), a branch capability, eventpoint loop operations (see U.S. Provisional Application Serial No. 60/140,245 entitled “Methods and Apparatus for Generalized Event Detection and Action Specification in a Processor” filed Jun. 21, 1999 for further details), and support for interrupts. It also provides the instruction memory control which could include an instruction cache if needed by an application.
- the I-fetch controller 103 controls the dispatch of instruction words and instruction control information to the other PEs in the system by means of a D-bit instruction bus 102 .
- the instruction bus 102 may include additional control signals as needed in an abbreviated-instruction translation apparatus.
- the execution units 131 in the combined SP/PE 0 101 can be separated into a set of execution units optimized for the control function; for example, fixed point execution units in the SP, and the PE 0 as well as the other PEs can be optimized for a floating point application.
- the execution units 131 are of the same type in the SP/PE 0 and the PEs.
- SP/PE 0 and the other PEs use a five instruction slot iVLIW architecture which contains a VLIW instruction memory (VIM) 109 and an instruction decode and VIM controller functional unit 107 which receives instructions as dispatched from the SP/PE 0 's I-fetch unit 103 and generates VIM addresses and control signals 108 required to access the iVLIWs stored in the VIM.
- VIM VLIW instruction memory
- Referenced instruction types are identified by the letters SLAMD in VIM 109 , where the letters are matched up with instruction types as follows: Store (S), Load (L), ALU (A), MAU (M), and DSU (D).
- the data memory interface controller 125 must handle the data processing needs of both the SP controller, with SP data in memory 121 , and PE 0 , with PE 0 data in memory 123 .
- the SP/PE 0 controller 125 also is the controlling point of the data that is sent over the 32-bit or 64-bit broadcast data bus 126 .
- the other PEs, 151 , 153 , and 155 contain common physical data memory units 123 ′, 123 ′′, and 123 ′′′ though the data stored in them is generally different as required by the local processing done on each PE.
- the interface to these PE data memories is also a common design in PEs 1 , 2 , and 3 and indicated by PE local memory and data bus interface logic 157 , 157 ′ and 157 ′′.
- Interconnecting the PEs for data transfer communications is the cluster switch 171 various aspects of which are described in greater detail in U.S. patent application Ser. No. 08/885,310 entitled “Manifold Array Processor”, now U.S. Pat. No. 6,023,753, and U.S. patent application Ser. No. 09/169,256 entitled “Methods and Apparatus for Manifold Array Processing”, and U.S. patent application Ser. No. 09/169,256 entitled “Methods and Apparatus for ManArray PE-to-PE Switch Control”.
- a primary interface mechanism is contained in a direct memory access (DMA) control unit 181 that provides a scalable ManArray data bus 183 that connects to devices and interface units external to the ManArray core.
- the DMA control unit 181 provides the data flow and bus arbitration mechanisms needed for these external devices to interface to the ManArray core memories via the multiplexed bus interface represented by line 185 .
- a high level view of a ManArray control bus (MCB) 191 is also shown in FIG. 1.
- a bit rake instruction operating as shown in diagram 220 of FIG. 2B copies all bits, determined by a mask register, such as Rye, from a source register, such as Rxe, and packs the bits into the least significant bit (LSB) positions of a target register, such as Rte.
- FIG. 2C shows a block diagram 250 of exemplary syntax and operation of a bit rake instruction in accordance with the present invention.
- the high order bits of Rte may be set to zero (.Z), to the most significant bit (MSB) of the extracted field (.X), or to the un-extracted (unmasked) Rxe bits (.U).
- Rye contains ‘1’s in the bit positions that are copied from Rxe to the LSB positions of Rte. Rye contains ‘0’s at the bit positions that are either copied from Rxe to the MSB positions of Rte, or are ignored. Thus, in a preferred embodiment, Rxe, Rye and Rte are the same size.
- the syntax and operation of the word .1W version 260 of a bit rake instruction is also shown in FIG. 2C.
- the lower case letters (a-f) represent unmasked source bit regions and the upper case letters (S, A-J) represent the masked source bits. S & A-J are merged toward the right, and either the unmasked source bit-regions (a-f) are merged toward the left, or zero or the most significant extracted bit (S), is extended toward the left.
- such instruction could be written as:
- the present invention includes techniques which segments the implementation of a bit rake instruction into multiple simpler problems which are more easily solved.
- the segmentation technique includes both temporal and spatial aspects. Multiple successive stages are employed with each stage building on the previous stage's result. Information flows through the stages temporally. Information at each stage is partitioned into multiple independent information groups, thereby improving operation concurrency spatially. As information advances through the stages, the number of independent information groups decreases while the size of each group increases. As the group size increases, so does the regularity of the information within, allowing increasingly efficient data movement at each successive stage.
- FIG. 3A shows a block diagram of a bit rake apparatus 300 in accordance with the present invention.
- the present invention may suitably include three primary functional blocks: an adder tree block 310 , a mask path block 320 and a data path block 330 , each comprising a plurality of stages.
- the adder tree 310 computes the sum of the number of mask bits in each of the groups for all power-of-two group sizes.
- the adder tree block 310 comprises a plurality of adder stages, with each adder's sum and carry output providing control to the corresponding mask path block 320 and data path block 330 .
- the mask path block 320 provides individual group masks at each stage for use in controlling the selection of data in the data path block 330 .
- data and mask movement in the mask path block 320 and data path block 330 utilizes a binary shifter followed by a multiplexer.
- the depth of the binary shifter increases by one multiplexer level with each stage advance. Shifting amounts and group sizes are restricted to powers-of-two to maintain minimal propagation delays through shifters, and yield the most efficient adder sizing.
- Propagation delays through the three primary functional blocks 310 , 320 and 330 and their inter-block controls 340 and 350 are preferably balanced. Results at each stage in all three blocks proceed through their paths in unison. Depending upon the implementation and technology process, the adder stage may include a slightly longer or shorter delay. Balancing the propagation delay aids in minimizing the overall critical timing path propagation delay.
- FIG. 3B shows a detailed view of the bit rake apparatus 300 .
- the data path block 330 is controlled by the adder tree 310 and the mask path block 320 .
- the numbers in the adder boxes in the bit adder tree 310 refer to the maximum value of the sum of the inputs. Consequently, the output of each adder block has a maximum value which is a power of two.
- the mask path block 320 is controlled by the adder tree block 310 . It is noted that depending upon the implementation and circuit technology chosen, the first several levels of the adder tree block 310 , mask path block 320 and data path 330 may undergo logic reduction to result in a more efficient gate usage and minimal delay, yet maintain the same functionality.
- FIG. 4 includes an exemplary diagram 400 showing how a 64-bit result may be obtained by successively sorting groups of asserted mask bits, such as mask bits contained in register Rye, in increasing powers-of-two sizes, starting with smaller groups, and progressively increasing the group size through an input 402 and a series of stages 404 , 406 , 408 , 410 , 412 and 414 .
- This technique may be suitably applied to the data values contained in register Rxe.
- sorting involves multiple independent bit groups of similar size.
- the extraction technique combines each pair of adjacent bit groups by realigning the left group into the right group using a binary shifter.
- a binary shifter By sorting in powers-of-two as shown in FIG. 4, a binary shifter of increasing size can be used at each level to provide an efficient realignment of bits, with little control logic cost or delay.
- a binary shifter may include a shifter with only power-of-two shift amounts, and shifts in only one direction.
- Input 402 shows a field of 64 bits. The “1”s represent asserted mask bits.
- Data movement from input 402 to stage 404 involves combining the 64 bits into 32 groups containing 2 bits each. Each adjacent pair of bits is combined into a 2-bit group by moving the “1” bits to the right. For example “00” becomes “00”, “01” becomes “01”, “10” becomes “01”, and “11” becomes “11”. Two mask bit movements occur in the transition from input 402 to stage 404 .
- Stage 404 shows 32 groups of 2-bit fields.
- Data movement from stage 404 to stage 406 involves utilizing sixteen adjacent pairs of 2-bit groups. In each of these sixteen group pairs, using the number of unasserted mask bits in the right group of each pair, the left group is shifted that amount to the right.
- bits 404 a have one “0” in the right group causing the left group of 2 bits to shift right 1 position.
- the “1” bit in the right group is retained, and becomes the rightmost bit in the resulting group of 4 bits (0011).
- the middle 2 bits (01) are from the shifted left group, and the remaining, leftmost bit is “0” filled by the mechanism.
- Stage 406 shows 16 groups of 4-bit fields. Data movement from stage 406 to stage 408 involves utilizing 8 adjacent pairs of 4-bit groups. In each of these 8 pairs in stage 406 , the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount. As an example in stage 406 , bits 406 a are right group of bits in which all 4 bits are asserted (1111). Since all of the bits are asserted, in moving from stage 406 to stage 408 , the left group of bits (0001) is not shifted (shifted amount equals zero) and combined with the right group to form 00011111.
- Bits 406 b are a right group of bits in which all 4 bits are unasserted (0000). Since all of the bits are unasserted, in moving from stage 406 to stage 408 , the left group of bits (0001) is shifted 4 positions and combined with the right group to form 00000001.
- Stage 408 shows 8 groups of 8-bit fields. Data movement from stage 408 to stage 410 involves 4 adjacent pairs of 8-bit groups. In each of these 4 pairs in stage 408 , the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount.
- Stage 410 shows 4 groups of 16-bit fields. Data movement from stage 410 to stage 412 involves 2 adjacent pairs of 16-bit groups. In each of these 2 pairs in stage 410 , the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount.
- Stage 412 shows 2 groups of 32-bit fields. Data movement from stage 412 to stage 414 involves both 32-bit groups. The left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount.
- FIG. 5 shows a diagram 500 of an exemplary right-shift to left-shift in accordance with the present invention.
- a pair of 4-bit groups 502 and 504 is shown generically as ABCD and WXYZ, respectively.
- the shift right (SHR) column 516 and shift left (SHL) column 518 border the result column 520 containing 8-bit data patterns for each case.
- the SHR column 516 shows how the left group is shifted to the right and “0”-filled to the left by an amount equal to the number of “0” bits in the right group.
- the shifted left group 502 is then merged with the “1” bits in the right group 504 .
- the SHL column 518 describes how the left group 502 is repositioned 4 bits to the right, aligning it exactly with the right group 504 , and then shifted to the left by an amount equal to the number of “1” bits in the right group 504 .
- the shifted left group 502 is merged with the “1” bits in the right group, and zero-filled to the left as required.
- FIG. 6 shows a left binary shifter 600 in accordance with the present invention, where the blocks, for example block 610 , are two-to-one multiplexers. Unlike the previous discussion where only a 2-bit left shifter was required, this shifter is used in the next successive stage where each 8-bit field is left shifted from 0 to 7 positions.
- the example shown in FIG. 5 corresponds to the transition from stage 406 to stage 408 in FIG. 4
- the binary shifter 600 of FIG. 6 corresponds to the transition from stage 408 to stage 410 in FIG. 4.
- the S 2 , S 1 and S 0 inputs, which control the shift amount, are provided from an appropriate adder tree sum output.
- FIG. 7 shows a data path tree diagram 700 , superimposed over the data fields, in accordance with the present invention.
- Each numbered box 702 of FIG. 7 represents the logic to shift and align data. For clarity of illustration, only a single box is associated with an element number.
- the shift amount and mask bits control the data path at each stage. The mask path directly determines which data bits are to be used.
- the rightmost data bits retain their previous stage's data value when their corresponding mask bits are asserted, and merge the left group's shifted data based upon the corresponding shift amount, as described in greater detail below with respect to FIGS. 9A and 9B.
- the binary shift amounts controlling the mask path and data path are generated from the Rye source.
- An adder tree 800 shown in FIG. 8 superimposed over the data fields, computes successive sums of bits on a power-of-two basis from 2-bit groups up to the larger 32-bit group for the adder tree functional block.
- each box labeled as 2 designates an addition of 2 1-bit numbers, and has an output range from 0 to 2.
- Each box labeled as 4 designates an addition of 2 2-bit numbers, and has an output range from 0 to 4.
- Each box labeled as 8 designates an addition of 2 3-bit numbers, and has an output range from 0 to 8.
- Each box labeled as 16 designates an addition of 2 4-bit numbers, and has an output range from 0 to 16.
- Each box labeled as 32 designates an addition of 2 5-bit numbers, and has an output range from 0 to 32. Most of the intermediate sums as well as the final sum are utilized to provide controlling data at each stage, as seen in FIG. 3B and indicated by lines 312 .
- FIG. 9A shows a dual path structure 900 representing typical control and data flow through the tree of the mask path and the data path.
- a rightmost data path or mask path branch 902 is shown with a corresponding adder tree branch 903 .
- Binary shifters 904 are designated S 1 , S 2 , S 3 , S 4 and S 5 , with the numeral suffixes referring to both the stage and the number of levels of multiplexer employed.
- the binary shifters 904 receive data inputs from the left bit group, shown as “mask/data from other branch.”
- the binary shifters 904 receive control inputs from the adder result at the appropriate level of the tree, shown as the “s” (sum) output from adder blocks 906 .
- Each adder block 906 is designated as C 2 , C 4 , C 8 , C 16 and C 32 , with the numeral suffixes referring to the number of bit positions summed from the source mask for each bit group.
- a plurality of single-level multiplexers (M 1 ) 908 and 910 are fed by the binary shifters 904 and the previous stage data.
- the leftmost M 1 908 refers to the leftmost bit group while the rightmost M 1 910 refers to the rightmost bit group at each stage.
- the leftmost M 1 908 is collectively controlled by the adder carry bit, and selects either the unshifted data bits when carry is asserted, or the shifted data bits when carry is unasserted. Optimal timing for the carry path is obtained by using an adder design where the carry out is no slower than the next most significant bit.
- Each bit of the rightmost M 1 910 is individually controlled by each of the corresponding mask bits.
- the unshifted previous stage data bits are selected where mask bits are asserted and the left-shifted data bits are selected where mask bits are unasserted.
- FIG. 9B shows a detailed view of a shifter and multiplexer stage 950 suitable for use with data path structure 900 .
- each asserted extracted mask bit is used to generate the final result by selecting either its datapath values or logical zero.
- each asserted extracted mask bit is used to generate the final result by selecting either its datapath value or the MSEB.
- the MSEB value is easily determined from the input values by finding the first asserted mask bit and selecting the data value, and can be done in parallel with the successive bit shifting mechanism.
- FIG. 10 shows a block diagram of circuitry 1000 suitable for performing a (.U) version of the bit rake instruction comprising an adder tree blocks 1310 , mask path blocks 1320 and data path blocks 1330 .
- Inverse results are computed in parallel with this mechanism by bit reversing the source mask and data values, as well as logically inverting the source mask value, then using an identical mechanism that produces “raked” unmasked data values, which can be used in the final selection multiplexers 1002 .
- the inverse source and data values are provided through bit reversers 1004 and an inverter 1006 . Inclusion of logic to implement the .U instruction form doubles the physical size of the circuitry, but has negligible delay increase.
Abstract
Description
- The present application claims the benefit of U.S. Provisional Application Serial No. 60/335,159 filed Nov. 1, 2001, which is incorporated by reference herein in its entirety.
- The present invention relates generally to improvements in computational processing. More specifically, the present invention relates to a system and method for providing a bit rake instruction to extract a pattern of bits from a source register.
- In many communications-related standards a need exists for an instruction that allows getting or putting several bits from or to a register without having to operate on one bit at a time through a series of bit load or bit store instructions. For example, in ADSL QAM encoding every other bit from a bit stream is packed together to create a two's complement integer. When performing puncturing in convolutional encoding, some of the encoder's output bits are omitted before transmission. In one puncturing technique, every fourth bit is removed. In another case,
bits bits - The present invention provides a programmable system and method for performing a bit rake instruction which extracts an arbitrary pattern of bits from a source register, based on a mask provided in another register, and packs and right justifies the bits into a target register. The bit rake instruction allows any set of bits from the source register to be packed together.
- A more complete understanding of the present invention, as well as further features and advantages of the invention, will be apparent from the following detailed description and the accompanying drawings.
- FIG. 1 illustrates an exemplary ManArray DSP and DMA subsystem appropriate for use with this invention;
- FIG. 2A shows an exemplary encoding of a bit rake instruction in accordance with the present invention;
- FIG. 2B shows an exemplary operation of a bit rake instruction in accordance with the present invention;
- FIG. 2C shows syntax and operation of a bit rake instruction in accordance with the present invention;
- FIGS. 3A and 3B show diagrams of a bit rake apparatus in accordance with the present invention;
- FIG. 4 shows the sorting of groups of asserted mask bits in accordance with the present invention;
- FIG. 5 shows a right-shift to left-shift example in accordance with the present invention;
- FIG. 6 shows a 3-level shifter in accordance with the present invention;
- FIG. 7 shows a data path diagram in accordance with the present invention;
- FIG. 8 shows an adder tree in accordance with the present invention;
- FIG. 9A shows a data path structure in accordance with the present invention;
- FIG. 9B shows a shifter and multiplexer stage in accordance with the present invention; and
- FIG. 10 shows a diagram of a bit rake instruction apparatus in accordance with the present invention
- The present invention now will be described more fully with reference to the accompanying drawings, in which several presently preferred embodiments of the invention are shown. This invention may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
- Further details of a presently preferred ManArray core, architecture, and instructions for use in conjunction with the present invention are found in U.S. patent application Ser. No. 08/885,310 filed Jun. 30, 1997, now U.S. Pat. No. 6,023,753, U.S. patent application Ser. No.08/949,122 filed Oct. 10, 1997, now U.S. Pat. No. 6,167,502, U.S. patent application Ser. No. 09/169,256 filed Oct. 9, 1998, now U.S. Pat. No. 6,167,501, U.S. patent application Ser. No. 09/169,072 filed Oct. 9, 1998, now U.S. Pat. No. 6,219,776, U.S. patent application Ser. No. 09/187,539 filed Nov. 6, 1998, now U.S. Pat. No. 6,151,668, U.S. patent application Ser. No. 09/205,558 filed Dec. 4, 1998, now U.S. Pat. No. 6,173,389, U.S. patent application Ser. No. 09/215,081 filed Dec. 18, 1998, now U.S. Pat. No. 6,101,592, U.S. patent application Ser. No. 09/228,374 filed Jan. 12, 1999, now U.S. Pat. No. 6,216,223, U.S. patent application Ser. No. 09/471,217 filed Dec. 23, 1999, now U.S. Pat. No. 6,260,082, U.S. patent application Ser. No. 09/472,372 filed Dec. 23, 1999, now U.S. Pat. No. 6,256,683, U.S. patent application Ser. No. 09/238,446 filed Jan. 28, 1999, U.S. patent application Ser. No. 09/267,570 filed Mar. 12, 1999, U.S. patent application Ser. No. 09/337,839 filed Jun. 22, 1999, U.S. patent application Ser. No. 09/350,191 filed Jul. 9, 1999, U.S. patent application Ser. No. 09/422,015 filed Oct. 21, 1999, U.S. patent application Ser. No. 09/432,705 filed Nov. 2, 1999, U.S. patent application Ser. No. 09/596,103 filed Jun. 16, 2000, U.S. patent application Ser. No. 09/598,567 filed Jun. 21, 2000, U.S. patent application Ser. No. 09/598,564 filed Jun. 21, 2000, U.S. patent application Ser. No. 09/598,566 filed Jun. 21, 2000, U.S. patent application Ser. No. 09/598,558 filed Jun. 21, 2000, U.S. patent application Ser. No. 09/598,084 filed Jun. 21, 2000, U.S. patent application Ser. No. 09/599,980 filed Jun. 22, 2000, U.S. patent application Ser. No. 09/711,218 filed Nov. 9, 2000, U.S. patent application Ser. No. 09/747,056 filed Dec. 12, 2000, U.S. patent application Ser. No. 09/853,989 filed May 11, 2001, U.S. patent application Ser. No. 09/886,855 filed Jun. 21, 2001, U.S. patent application Ser. No. 09/791,940 filed Feb. 23, 2001, U.S. patent application Ser. No. 09/792,819 filed Feb. 23, 2001, U.S. patent application Ser. No. 09/792,256 filed Feb. 23, 2001, U.S. patent application Ser. No. ______ entitled “Methods and Apparatus for Efficient Vocoder Implementations” filed Oct. 19, 2001, Provisional Application Serial No. 60/251,072 filed Dec. 4, 2000, Provisional Application Serial No. 60/281,523 filed Apr. 4, 2001, Provisional Application Serial No. 60/283,582 filed Apr. 13, 2001, Provisional Application Serial No. 60/287,270 filed Apr. 27, 2001, Provisional Application Serial No. 60/288,965 filed May 4, 2001, Provisional Application Serial No. 60/298,624 filed Jun. 15, 2001, Provisional Application Serial No. 60/298,695 filed Jun. 15, 2001, Provisional Application Serial No. 60/298,696 filed Jun. 15, 2001, Provisional Application Serial No. 60/318,745 filed Sep. 11, 2001, Provisional Application Serial No. ______ entitled “Methods and Apparatus for Video Coding” filed Oct. 30, 2001 all of which are assigned to the assignee of the present invention and incorporated by reference herein in their entirety.
- In a presently preferred embodiment of the present invention, a
ManArray 2×2 iVLIW single instruction multiple data stream (SIMD)processor 100 as shown in FIG. 1 may be adapted as described further below for use in conjunction with the present invention.Processor 100 comprises a sequence processor (SP) controller combined with a processing element-0 (PE0) to form an SP/PE0 combinedunit 101, as described in further detail in U.S. patent application Ser. No. 09/169,072 entitled “Methods and Apparatus for Dynamically Merging an Array Controller with an Array Processing Element”. Threeadditional PEs PE0 101 contains an instruction fetch (I-fetch)controller 103 to allow the fetching of “short” instruction words (SIW) or abbreviated-instruction words from a B-bit instruction memory 105, where B is determined by the application instruction-abbreviation process to be a reduced number of bits representing ManArray native instructions and/or to contain two or more abbreviated instructions as described in the present invention. If an instruction abbreviation apparatus is not used then B is determined by the SIW format. The fetchcontroller 103 provides the typical functions needed in a programmable processor, such as a program counter (PC), a branch capability, eventpoint loop operations (see U.S. Provisional Application Serial No. 60/140,245 entitled “Methods and Apparatus for Generalized Event Detection and Action Specification in a Processor” filed Jun. 21, 1999 for further details), and support for interrupts. It also provides the instruction memory control which could include an instruction cache if needed by an application. In addition, the I-fetchcontroller 103 controls the dispatch of instruction words and instruction control information to the other PEs in the system by means of a D-bit instruction bus 102. D is determined by the implementation, which for the exemplary ManArray coprocessor D=32-bits. Theinstruction bus 102 may include additional control signals as needed in an abbreviated-instruction translation apparatus. - In this
exemplary system 100, common elements are used throughout to simplify the explanation, though actual implementations are not limited to this restriction. For example, theexecution units 131 in the combined SP/PE0 101 can be separated into a set of execution units optimized for the control function; for example, fixed point execution units in the SP, and the PE0 as well as the other PEs can be optimized for a floating point application. For the purposes of this description, it is assumed that theexecution units 131 are of the same type in the SP/PE0 and the PEs. In a similar manner, SP/PE0 and the other PEs use a five instruction slot iVLIW architecture which contains a VLIW instruction memory (VIM) 109 and an instruction decode and VIM controllerfunctional unit 107 which receives instructions as dispatched from the SP/PE0's I-fetchunit 103 and generates VIM addresses andcontrol signals 108 required to access the iVLIWs stored in the VIM. Referenced instruction types are identified by the letters SLAMD inVIM 109, where the letters are matched up with instruction types as follows: Store (S), Load (L), ALU (A), MAU (M), and DSU (D). - The basic concept of loading the iVLIWs is described in further detail in U.S. patent application Ser. No. 09/187,539 entitled “Methods and Apparatus for Efficient Synchronous MIMD Operations with iVLIW PE-to-PE Communication”. Also contained in the SP/PE0 and the other PEs is a common PE
configurable register file 127 which is described in further detail in U.S. patent application Ser. No. 09/169,255 entitled “Method and Apparatus for Dynamic Instruction Controlled Reconfiguration Register File with Extended Precision”. Due to the combined nature of the SP/PE0, the data memory interface controller 125 must handle the data processing needs of both the SP controller, with SP data inmemory 121, and PE0, with PE0 data inmemory 123. The SP/PE0 controller 125 also is the controlling point of the data that is sent over the 32-bit or 64-bitbroadcast data bus 126. The other PEs, 151, 153, and 155 contain common physicaldata memory units 123′, 123″, and 123′″ though the data stored in them is generally different as required by the local processing done on each PE. The interface to these PE data memories is also a common design inPEs bus interface logic cluster switch 171 various aspects of which are described in greater detail in U.S. patent application Ser. No. 08/885,310 entitled “Manifold Array Processor”, now U.S. Pat. No. 6,023,753, and U.S. patent application Ser. No. 09/169,256 entitled “Methods and Apparatus for Manifold Array Processing”, and U.S. patent application Ser. No. 09/169,256 entitled “Methods and Apparatus for ManArray PE-to-PE Switch Control”. The interface to a host processor, other peripheral devices, and/or external memory can be done in many ways. For completeness, a primary interface mechanism is contained in a direct memory access (DMA)control unit 181 that provides a scalableManArray data bus 183 that connects to devices and interface units external to the ManArray core. TheDMA control unit 181 provides the data flow and bus arbitration mechanisms needed for these external devices to interface to the ManArray core memories via the multiplexed bus interface represented byline 185. A high level view of a ManArray control bus (MCB) 191 is also shown in FIG. 1. - As seen in
instruction format 200 of FIG. 2A, a bit rake instruction operating as shown in diagram 220 of FIG. 2B copies all bits, determined by a mask register, such as Rye, from a source register, such as Rxe, and packs the bits into the least significant bit (LSB) positions of a target register, such as Rte. FIG. 2C shows a block diagram 250 of exemplary syntax and operation of a bit rake instruction in accordance with the present invention. For the doubleword .1D version 255, the high order bits of Rte may be set to zero (.Z), to the most significant bit (MSB) of the extracted field (.X), or to the un-extracted (unmasked) Rxe bits (.U). Rye contains ‘1’s in the bit positions that are copied from Rxe to the LSB positions of Rte. Rye contains ‘0’s at the bit positions that are either copied from Rxe to the MSB positions of Rte, or are ignored. Thus, in a preferred embodiment, Rxe, Rye and Rte are the same size. The syntax and operation of the word .1W version 260 of a bit rake instruction is also shown in FIG. 2C. - As seen in the example shown in FIG. 2B, the lower case letters (a-f) represent unmasked source bit regions and the upper case letters (S, A-J) represent the masked source bits. S & A-J are merged toward the right, and either the unmasked source bit-regions (a-f) are merged toward the left, or zero or the most significant extracted bit (S), is extended toward the left. Utilizing the syntax shown in FIG. 2C, such instruction could be written as:
- BITRAKE.[SP]A.1D.[UXZ]Rte, Rxe, Rye
- Further variations could also be generalized to dual 32-bit as well as other data
- The present invention includes techniques which segments the implementation of a bit rake instruction into multiple simpler problems which are more easily solved. The segmentation technique includes both temporal and spatial aspects. Multiple successive stages are employed with each stage building on the previous stage's result. Information flows through the stages temporally. Information at each stage is partitioned into multiple independent information groups, thereby improving operation concurrency spatially. As information advances through the stages, the number of independent information groups decreases while the size of each group increases. As the group size increases, so does the regularity of the information within, allowing increasingly efficient data movement at each successive stage.
- FIG. 3A shows a block diagram of a
bit rake apparatus 300 in accordance with the present invention. As seen in FIG. 3A, the present invention may suitably include three primary functional blocks: anadder tree block 310, a mask path block 320 and a data path block 330, each comprising a plurality of stages. Theadder tree 310 computes the sum of the number of mask bits in each of the groups for all power-of-two group sizes. Theadder tree block 310 comprises a plurality of adder stages, with each adder's sum and carry output providing control to the corresponding mask path block 320 and data path block 330. The mask path block 320 provides individual group masks at each stage for use in controlling the selection of data in the data path block 330. - As described in greater detail below, data and mask movement in the mask path block320 and data path block 330 utilizes a binary shifter followed by a multiplexer. The depth of the binary shifter increases by one multiplexer level with each stage advance. Shifting amounts and group sizes are restricted to powers-of-two to maintain minimal propagation delays through shifters, and yield the most efficient adder sizing.
- Propagation delays through the three primary functional blocks310, 320 and 330 and their
inter-block controls 340 and 350 are preferably balanced. Results at each stage in all three blocks proceed through their paths in unison. Depending upon the implementation and technology process, the adder stage may include a slightly longer or shorter delay. Balancing the propagation delay aids in minimizing the overall critical timing path propagation delay. - FIG. 3B shows a detailed view of the
bit rake apparatus 300. As seen in FIG. 3B and described in greater detail below, the data path block 330 is controlled by theadder tree 310 and the mask path block 320. The numbers in the adder boxes in thebit adder tree 310 refer to the maximum value of the sum of the inputs. Consequently, the output of each adder block has a maximum value which is a power of two. The mask path block 320 is controlled by theadder tree block 310. It is noted that depending upon the implementation and circuit technology chosen, the first several levels of theadder tree block 310, mask path block 320 anddata path 330 may undergo logic reduction to result in a more efficient gate usage and minimal delay, yet maintain the same functionality. - The following provides an example describing the data movement through the stages in a right-shifting fashion, showing how data moves from a programmer's perspective. Next, it is shown that by reorienting portions of the information, left shifting, and using the normally occurring carry outputs from the adder tree, a more efficient data movement mechanism, with reduced size and delay, is produced. After the basic extraction mechanism is described for extracting all of the masked data, a description is given for how to also generate the extraction of the unmasked bits.
- FIG. 4 includes an exemplary diagram400 showing how a 64-bit result may be obtained by successively sorting groups of asserted mask bits, such as mask bits contained in register Rye, in increasing powers-of-two sizes, starting with smaller groups, and progressively increasing the group size through an
input 402 and a series ofstages - By sorting in powers-of-two as shown in FIG. 4, a binary shifter of increasing size can be used at each level to provide an efficient realignment of bits, with little control logic cost or delay. In the present context, a binary shifter may include a shifter with only power-of-two shift amounts, and shifts in only one direction. Input402 shows a field of 64 bits. The “1”s represent asserted mask bits. Data movement from
input 402 to stage 404 involves combining the 64 bits into 32 groups containing 2 bits each. Each adjacent pair of bits is combined into a 2-bit group by moving the “1” bits to the right. For example “00” becomes “00”, “01” becomes “01”, “10” becomes “01”, and “11” becomes “11”. Two mask bit movements occur in the transition frominput 402 to stage 404. - Stage404 shows 32 groups of 2-bit fields. Data movement from stage 404 to stage 406 involves utilizing sixteen adjacent pairs of 2-bit groups. In each of these sixteen group pairs, using the number of unasserted mask bits in the right group of each pair, the left group is shifted that amount to the right. As an example in stage 404, bits 404 a have one “0” in the right group causing the left group of 2 bits to shift right 1 position. The “1” bit in the right group is retained, and becomes the rightmost bit in the resulting group of 4 bits (0011). The middle 2 bits (01) are from the shifted left group, and the remaining, leftmost bit is “0” filled by the mechanism.
-
Stage 406 shows 16 groups of 4-bit fields. Data movement fromstage 406 to stage 408 involves utilizing 8 adjacent pairs of 4-bit groups. In each of these 8 pairs instage 406, the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount. As an example instage 406, bits 406 a are right group of bits in which all 4 bits are asserted (1111). Since all of the bits are asserted, in moving fromstage 406 to stage 408, the left group of bits (0001) is not shifted (shifted amount equals zero) and combined with the right group to form 00011111. Bits 406 b are a right group of bits in which all 4 bits are unasserted (0000). Since all of the bits are unasserted, in moving fromstage 406 to stage 408, the left group of bits (0001) is shifted 4 positions and combined with the right group to form 00000001. -
Stage 408 shows 8 groups of 8-bit fields. Data movement fromstage 408 to stage 410 involves 4 adjacent pairs of 8-bit groups. In each of these 4 pairs instage 408, the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount. -
Stage 410 shows 4 groups of 16-bit fields. Data movement fromstage 410 to stage 412 involves 2 adjacent pairs of 16-bit groups. In each of these 2 pairs instage 410, the left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount. -
Stage 412 shows 2 groups of 32-bit fields. Data movement fromstage 412 to stage 414 involves both 32-bit groups. The left group is shifted to the right by the number of unasserted mask bits in the right group. Any “1” bits in the right group are retained, and zeros are filled on the left according to the shift amount. - In the example shown in FIG. 4, the number of unasserted mask bits was computed and used to determine the amount to shift right. However, in an alternate embodiment of the present invention, a functionally equivalent alternative technique is utilized to count the number of asserted mask bits and left-shift a repositioned left group. This technique is described in further detail below and shown in FIG. 5 which shows a diagram500 of an exemplary right-shift to left-shift in accordance with the present invention. A pair of 4-
bit groups 502 and 504 is shown generically as ABCD and WXYZ, respectively. Fivecases column 516 and shift left (SHL) column 518 border theresult column 520 containing 8-bit data patterns for each case. TheSHR column 516 shows how the left group is shifted to the right and “0”-filled to the left by an amount equal to the number of “0” bits in the right group. The shifted left group 502 is then merged with the “1” bits in theright group 504. The SHL column 518 describes how the left group 502 is repositioned 4 bits to the right, aligning it exactly with theright group 504, and then shifted to the left by an amount equal to the number of “1” bits in theright group 504. As described above, the shifted left group 502 is merged with the “1” bits in the right group, and zero-filled to the left as required. - To obtain the results shown in the
results column 520, the right group requires a binary shifter followed by 2:1 multiplexer to perform the merge with the “1” bits, while the left group requires only the binary shifter output. Therefore, the left group can tolerate an additional multiplexer delay without increasing overall stage delay. Further details are shown in FIGS. 9A and 9B and described in greater detail below. Using this additional left-group multiplexer under control of the adder carry bit to accomplish the SHL4 data movement, a left shifter with only 2 levels of multiplexer delay (SHL =00, 01, 10, 11) instead of 3 may be utilized. Shifting left by 4 is not needed, reducing the number of logic levels for binary shifters in each stage. - FIG. 6 shows a left
binary shifter 600 in accordance with the present invention, where the blocks, forexample block 610, are two-to-one multiplexers. Unlike the previous discussion where only a 2-bit left shifter was required, this shifter is used in the next successive stage where each 8-bit field is left shifted from 0 to 7 positions. In other words, the example shown in FIG. 5 corresponds to the transition fromstage 406 to stage 408 in FIG. 4, and thebinary shifter 600 of FIG. 6 corresponds to the transition fromstage 408 to stage 410 in FIG. 4. The S2, S1 and S0 inputs, which control the shift amount, are provided from an appropriate adder tree sum output. - The mask extraction mechanism described above for asserted mask bits from Rye may be applied similarly to the data bits from Rxe. FIG. 7 shows a data path tree diagram700, superimposed over the data fields, in accordance with the present invention. Each numbered box 702 of FIG. 7 represents the logic to shift and align data. For clarity of illustration, only a single box is associated with an element number. The shift amount and mask bits control the data path at each stage. The mask path directly determines which data bits are to be used. In contrast to the mask path, where the mask bits were retained in the right most pair of groups, for the data path the rightmost data bits retain their previous stage's data value when their corresponding mask bits are asserted, and merge the left group's shifted data based upon the corresponding shift amount, as described in greater detail below with respect to FIGS. 9A and 9B.
- The binary shift amounts controlling the mask path and data path are generated from the Rye source. An
adder tree 800, shown in FIG. 8 superimposed over the data fields, computes successive sums of bits on a power-of-two basis from 2-bit groups up to the larger 32-bit group for the adder tree functional block. In FIG. 8, each box labeled as 2 designates an addition of 2 1-bit numbers, and has an output range from 0 to 2. Each box labeled as 4 designates an addition of 2 2-bit numbers, and has an output range from 0 to 4. Each box labeled as 8 designates an addition of 2 3-bit numbers, and has an output range from 0 to 8. Each box labeled as 16 designates an addition of 2 4-bit numbers, and has an output range from 0 to 16. Each box labeled as 32 designates an addition of 2 5-bit numbers, and has an output range from 0 to 32. Most of the intermediate sums as well as the final sum are utilized to provide controlling data at each stage, as seen in FIG. 3B and indicated bylines 312. - FIG. 9A shows a
dual path structure 900 representing typical control and data flow through the tree of the mask path and the data path. A rightmost data path ormask path branch 902 is shown with a corresponding adder tree branch 903.Binary shifters 904 are designated S1, S2, S3, S4 and S5, with the numeral suffixes referring to both the stage and the number of levels of multiplexer employed. Thebinary shifters 904 receive data inputs from the left bit group, shown as “mask/data from other branch.” Thebinary shifters 904 receive control inputs from the adder result at the appropriate level of the tree, shown as the “s” (sum) output from adder blocks 906. Eachadder block 906 is designated as C2, C4, C8, C16 and C32, with the numeral suffixes referring to the number of bit positions summed from the source mask for each bit group. - A plurality of single-level multiplexers (M1) 908 and 910 are fed by the
binary shifters 904 and the previous stage data. Theleftmost M1 908 refers to the leftmost bit group while therightmost M1 910 refers to the rightmost bit group at each stage. Theleftmost M1 908 is collectively controlled by the adder carry bit, and selects either the unshifted data bits when carry is asserted, or the shifted data bits when carry is unasserted. Optimal timing for the carry path is obtained by using an adder design where the carry out is no slower than the next most significant bit. Each bit of therightmost M1 910 is individually controlled by each of the corresponding mask bits. The unshifted previous stage data bits are selected where mask bits are asserted and the left-shifted data bits are selected where mask bits are unasserted. - FIG. 9B shows a detailed view of a shifter and
multiplexer stage 950 suitable for use withdata path structure 900. “L” refers to leftmost bit group and “R” refers to rightmost bit group, with “n”=2, 4, 8, 16, 32, as shown in FIG. 9A. - For the zero-fill version (.Z) of the bit rake instruction instruction, each asserted extracted mask bit is used to generate the final result by selecting either its datapath values or logical zero. For the most significant extracted bit (MSEB) version (.X) of this instruction, each asserted extracted mask bit is used to generate the final result by selecting either its datapath value or the MSEB. The MSEB value is easily determined from the input values by finding the first asserted mask bit and selecting the data value, and can be done in parallel with the successive bit shifting mechanism.
- For the version of this instruction (.U), which also sorts the unmasked bits, each asserted extracted mask bit is used to generate the final result by selecting either its datapath value or the inverse result value. FIG. 10 shows a block diagram of
circuitry 1000 suitable for performing a (.U) version of the bit rake instruction comprising an adder tree blocks 1310, mask path blocks 1320 and data path blocks 1330. Inverse results are computed in parallel with this mechanism by bit reversing the source mask and data values, as well as logically inverting the source mask value, then using an identical mechanism that produces “raked” unmasked data values, which can be used in thefinal selection multiplexers 1002. The inverse source and data values are provided through bit reversers 1004 and aninverter 1006. Inclusion of logic to implement the .U instruction form doubles the physical size of the circuitry, but has negligible delay increase. - It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the present invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
Claims (31)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/282,919 US20030105945A1 (en) | 2001-11-01 | 2002-10-29 | Methods and apparatus for a bit rake instruction |
US12/021,538 US7836317B2 (en) | 2000-05-12 | 2008-01-29 | Methods and apparatus for power control in a scalable array of processor elements |
US12/239,920 US7685408B2 (en) | 2001-11-01 | 2008-09-29 | Methods and apparatus for extracting bits of a source register based on a mask and right justifying the bits into a target register |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US33515901P | 2001-11-01 | 2001-11-01 | |
US10/282,919 US20030105945A1 (en) | 2001-11-01 | 2002-10-29 | Methods and apparatus for a bit rake instruction |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/853,989 Continuation-In-Part US6845445B2 (en) | 2000-05-12 | 2001-05-11 | Methods and apparatus for power control in a scalable array of processor elements |
Related Child Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/021,538 Continuation-In-Part US7836317B2 (en) | 2000-05-12 | 2008-01-29 | Methods and apparatus for power control in a scalable array of processor elements |
US12/021,538 Continuation US7836317B2 (en) | 2000-05-12 | 2008-01-29 | Methods and apparatus for power control in a scalable array of processor elements |
US12/239,920 Continuation US7685408B2 (en) | 2001-11-01 | 2008-09-29 | Methods and apparatus for extracting bits of a source register based on a mask and right justifying the bits into a target register |
Publications (1)
Publication Number | Publication Date |
---|---|
US20030105945A1 true US20030105945A1 (en) | 2003-06-05 |
Family
ID=26961751
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/282,919 Abandoned US20030105945A1 (en) | 2000-05-12 | 2002-10-29 | Methods and apparatus for a bit rake instruction |
US12/239,920 Expired - Fee Related US7685408B2 (en) | 2001-11-01 | 2008-09-29 | Methods and apparatus for extracting bits of a source register based on a mask and right justifying the bits into a target register |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/239,920 Expired - Fee Related US7685408B2 (en) | 2001-11-01 | 2008-09-29 | Methods and apparatus for extracting bits of a source register based on a mask and right justifying the bits into a target register |
Country Status (1)
Country | Link |
---|---|
US (2) | US20030105945A1 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060023517A1 (en) * | 2004-07-27 | 2006-02-02 | Texas Instruments Incorporated | Method and system for dynamic address translation |
US20080100628A1 (en) * | 2006-10-31 | 2008-05-01 | International Business Machines Corporation | Single Precision Vector Permute Immediate with "Word" Vector Write Mask |
US20080114826A1 (en) * | 2006-10-31 | 2008-05-15 | Eric Oliver Mejdrich | Single Precision Vector Dot Product with "Word" Vector Write Mask |
US20080114824A1 (en) * | 2006-10-31 | 2008-05-15 | Eric Oliver Mejdrich | Single Precision Vector Permute Immediate with "Word" Vector Write Mask |
CN104025020A (en) * | 2011-12-23 | 2014-09-03 | 英特尔公司 | Systems, apparatuses, and methods for performing mask bit compression |
CN104137053A (en) * | 2011-12-23 | 2014-11-05 | 英特尔公司 | Systems, apparatuses, and methods for performing a butterfly horizontal and cross add or substract in response to a single instruction |
US20160094241A1 (en) * | 2014-09-26 | 2016-03-31 | Intel Corporation | Apparatus and method for vector compression |
US9639362B2 (en) | 2011-03-30 | 2017-05-02 | Nxp Usa, Inc. | Integrated circuit device and methods of performing bit manipulation therefor |
EP3238035A4 (en) * | 2014-12-27 | 2018-08-29 | Intel Corporation | Method and apparatus for performing a vector bit shuffle |
WO2019046710A1 (en) * | 2017-08-31 | 2019-03-07 | MIPS Tech, LLC | Unified logic for aliased processor instructions |
US10846089B2 (en) | 2017-08-31 | 2020-11-24 | MIPS Tech, LLC | Unified logic for aliased processor instructions |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8275978B1 (en) * | 2008-07-29 | 2012-09-25 | Marvell International Ltd. | Execution of conditional branch instruction specifying branch point operand to be stored in jump stack with branch destination for jumping to upon matching program counter value |
US8607033B2 (en) | 2010-09-03 | 2013-12-10 | Lsi Corporation | Sequentially packing mask selected bits from plural words in circularly coupled register pair for transferring filled register bits to memory |
US9280342B2 (en) * | 2011-07-20 | 2016-03-08 | Oracle International Corporation | Vector operations for compressing selected vector elements |
WO2013077884A1 (en) * | 2011-11-25 | 2013-05-30 | Intel Corporation | Instruction and logic to provide conversions between a mask register and a general purpose register or memory |
US20130332701A1 (en) * | 2011-12-23 | 2013-12-12 | Jayashankar Bharadwaj | Apparatus and method for selecting elements of a vector computation |
CN104094218B (en) * | 2011-12-23 | 2017-08-29 | 英特尔公司 | Systems, devices and methods for performing the conversion for writing a series of index values of the mask register into vector registor |
CN104081336B (en) * | 2011-12-23 | 2018-10-23 | 英特尔公司 | Device and method for detecting the identical element in vector registor |
WO2013095634A1 (en) * | 2011-12-23 | 2013-06-27 | Intel Corporation | Systems, apparatuses, and methods for performing a horizontal partial sum in response to a single instruction |
WO2013095635A1 (en) * | 2011-12-23 | 2013-06-27 | Intel Corporation | Instruction for merging mask patterns |
CN104126167B (en) * | 2011-12-23 | 2018-05-11 | 英特尔公司 | Apparatus and method for being broadcasted from from general register to vector registor |
WO2013095608A1 (en) * | 2011-12-23 | 2013-06-27 | Intel Corporation | Apparatus and method for vectorization with speculation support |
US9395988B2 (en) | 2013-03-08 | 2016-07-19 | Samsung Electronics Co., Ltd. | Micro-ops including packed source and destination fields |
US9411593B2 (en) * | 2013-03-15 | 2016-08-09 | Intel Corporation | Processors, methods, systems, and instructions to consolidate unmasked elements of operation masks |
US9946331B2 (en) * | 2014-06-27 | 2018-04-17 | Samsung Electronics Co., Ltd. | System and method to process signals having a common component |
US9904548B2 (en) * | 2014-12-22 | 2018-02-27 | Intel Corporation | Instruction and logic to perform a centrifuge operation |
US20160188333A1 (en) * | 2014-12-27 | 2016-06-30 | Intel Coporation | Method and apparatus for compressing a mask value |
US10891131B2 (en) | 2016-09-22 | 2021-01-12 | Intel Corporation | Processors, methods, systems, and instructions to consolidate data elements and generate index updates |
KR102366069B1 (en) * | 2017-06-26 | 2022-02-23 | 스티븐 타린 | Systems and methods for transforming large data into a smaller representation and re-converting the smaller representation to the original large data. |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4085447A (en) * | 1976-09-07 | 1978-04-18 | Sperry Rand Corporation | Right justified mask transfer apparatus |
US5487159A (en) * | 1993-12-23 | 1996-01-23 | Unisys Corporation | System for processing shift, mask, and merge operations in one instruction |
US5781457A (en) * | 1994-03-08 | 1998-07-14 | Exponential Technology, Inc. | Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU |
US6411980B2 (en) * | 1997-10-15 | 2002-06-25 | Kabushiki Kaisha Toshiba | Data split parallel shifter and parallel adder/subtractor |
US6618804B1 (en) * | 2000-04-07 | 2003-09-09 | Sun Microsystems, Inc. | System and method for rearranging bits of a data word in accordance with a mask using sorting |
US6629239B1 (en) * | 2000-04-07 | 2003-09-30 | Sun Microsystems, Inc. | System and method for unpacking and merging bits of a data world in accordance with bits of a mask word |
US6715066B1 (en) * | 2000-04-07 | 2004-03-30 | Sun Microsystems, Inc. | System and method for arranging bits of a data word in accordance with a mask |
US6718492B1 (en) * | 2000-04-07 | 2004-04-06 | Sun Microsystems, Inc. | System and method for arranging bits of a data word in accordance with a mask |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5729482A (en) * | 1995-10-31 | 1998-03-17 | Lsi Logic Corporation | Microprocessor shifter using rotation and masking operations |
US20020002666A1 (en) * | 1998-10-12 | 2002-01-03 | Carole Dulong | Conditional operand selection using mask operations |
US6243808B1 (en) * | 1999-03-08 | 2001-06-05 | Chameleon Systems, Inc. | Digital data bit order conversion using universal switch matrix comprising rows of bit swapping selector groups |
US6622242B1 (en) * | 2000-04-07 | 2003-09-16 | Sun Microsystems, Inc. | System and method for performing generalized operations in connection with bits units of a data word |
EP1230589A4 (en) * | 2000-05-05 | 2008-03-19 | Ruby B Lee | A method and system for performing permutations using permutation instructions based on modified omega and flip stages |
-
2002
- 2002-10-29 US US10/282,919 patent/US20030105945A1/en not_active Abandoned
-
2008
- 2008-09-29 US US12/239,920 patent/US7685408B2/en not_active Expired - Fee Related
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4085447A (en) * | 1976-09-07 | 1978-04-18 | Sperry Rand Corporation | Right justified mask transfer apparatus |
US5487159A (en) * | 1993-12-23 | 1996-01-23 | Unisys Corporation | System for processing shift, mask, and merge operations in one instruction |
US5781457A (en) * | 1994-03-08 | 1998-07-14 | Exponential Technology, Inc. | Merge/mask, rotate/shift, and boolean operations from two instruction sets executed in a vectored mux on a dual-ALU |
US6411980B2 (en) * | 1997-10-15 | 2002-06-25 | Kabushiki Kaisha Toshiba | Data split parallel shifter and parallel adder/subtractor |
US6618804B1 (en) * | 2000-04-07 | 2003-09-09 | Sun Microsystems, Inc. | System and method for rearranging bits of a data word in accordance with a mask using sorting |
US6629239B1 (en) * | 2000-04-07 | 2003-09-30 | Sun Microsystems, Inc. | System and method for unpacking and merging bits of a data world in accordance with bits of a mask word |
US6715066B1 (en) * | 2000-04-07 | 2004-03-30 | Sun Microsystems, Inc. | System and method for arranging bits of a data word in accordance with a mask |
US6718492B1 (en) * | 2000-04-07 | 2004-04-06 | Sun Microsystems, Inc. | System and method for arranging bits of a data word in accordance with a mask |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060023517A1 (en) * | 2004-07-27 | 2006-02-02 | Texas Instruments Incorporated | Method and system for dynamic address translation |
US9495724B2 (en) * | 2006-10-31 | 2016-11-15 | International Business Machines Corporation | Single precision vector permute immediate with “word” vector write mask |
US20080114824A1 (en) * | 2006-10-31 | 2008-05-15 | Eric Oliver Mejdrich | Single Precision Vector Permute Immediate with "Word" Vector Write Mask |
US8332452B2 (en) | 2006-10-31 | 2012-12-11 | International Business Machines Corporation | Single precision vector dot product with “word” vector write mask |
US20080100628A1 (en) * | 2006-10-31 | 2008-05-01 | International Business Machines Corporation | Single Precision Vector Permute Immediate with "Word" Vector Write Mask |
US20080114826A1 (en) * | 2006-10-31 | 2008-05-15 | Eric Oliver Mejdrich | Single Precision Vector Dot Product with "Word" Vector Write Mask |
US9639362B2 (en) | 2011-03-30 | 2017-05-02 | Nxp Usa, Inc. | Integrated circuit device and methods of performing bit manipulation therefor |
CN104025020A (en) * | 2011-12-23 | 2014-09-03 | 英特尔公司 | Systems, apparatuses, and methods for performing mask bit compression |
US9983873B2 (en) | 2011-12-23 | 2018-05-29 | Intel Corporation | Systems, apparatuses, and methods for performing mask bit compression |
CN104137053A (en) * | 2011-12-23 | 2014-11-05 | 英特尔公司 | Systems, apparatuses, and methods for performing a butterfly horizontal and cross add or substract in response to a single instruction |
US20160094241A1 (en) * | 2014-09-26 | 2016-03-31 | Intel Corporation | Apparatus and method for vector compression |
US9929745B2 (en) * | 2014-09-26 | 2018-03-27 | Intel Corporation | Apparatus and method for vector compression |
US10623015B2 (en) | 2014-09-26 | 2020-04-14 | Intel Corporation | Apparatus and method for vector compression |
EP3238035A4 (en) * | 2014-12-27 | 2018-08-29 | Intel Corporation | Method and apparatus for performing a vector bit shuffle |
US10296489B2 (en) | 2014-12-27 | 2019-05-21 | Intel Corporation | Method and apparatus for performing a vector bit shuffle |
EP3736689A1 (en) * | 2014-12-27 | 2020-11-11 | INTEL Corporation | Method and apparatus for performing a vector bit shuffle |
WO2019046710A1 (en) * | 2017-08-31 | 2019-03-07 | MIPS Tech, LLC | Unified logic for aliased processor instructions |
US10846089B2 (en) | 2017-08-31 | 2020-11-24 | MIPS Tech, LLC | Unified logic for aliased processor instructions |
Also Published As
Publication number | Publication date |
---|---|
US7685408B2 (en) | 2010-03-23 |
US20090019269A1 (en) | 2009-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7685408B2 (en) | Methods and apparatus for extracting bits of a source register based on a mask and right justifying the bits into a target register | |
KR100348951B1 (en) | Memory store from a register pair conditional | |
US5680339A (en) | Method for rounding using redundant coded multiply result | |
US6438569B1 (en) | Sums of production datapath | |
US5596763A (en) | Three input arithmetic logic unit forming mixed arithmetic and boolean combinations | |
US5485411A (en) | Three input arithmetic logic unit forming the sum of a first input anded with a first boolean combination of a second input and a third input plus a second boolean combination of the second and third inputs | |
US5446651A (en) | Split multiply operation | |
US5995748A (en) | Three input arithmetic logic unit with shifter and/or mask generator | |
US5509129A (en) | Long instruction word controlling plural independent processor operations | |
US5606677A (en) | Packed word pair multiply operation forming output including most significant bits of product and other bits of one input | |
US5805913A (en) | Arithmetic logic unit with conditional register source selection | |
US5465224A (en) | Three input arithmetic logic unit forming the sum of a first Boolean combination of first, second and third inputs plus a second Boolean combination of first, second and third inputs | |
US6098163A (en) | Three input arithmetic logic unit with shifter | |
US5640578A (en) | Arithmetic logic unit having plural independent sections and register storing resultant indicator bit from every section | |
US5960193A (en) | Apparatus and system for sum of plural absolute differences | |
US5922066A (en) | Multifunction data aligner in wide data width processor | |
US6016538A (en) | Method, apparatus and system forming the sum of data in plural equal sections of a single data word | |
US6986025B2 (en) | Conditional execution per lane | |
US5493524A (en) | Three input arithmetic logic unit employing carry propagate logic | |
US6067613A (en) | Rotation register for orthogonal data transformation | |
KR100346515B1 (en) | Temporary pipeline register file for a superpipe lined superscalar processor | |
US5596519A (en) | Iterative division apparatus, system and method employing left most one's detection and left most one's detection with exclusive OR | |
WO2001035224A1 (en) | Bit-serial memory access with wide processing elements for simd arrays | |
US5442581A (en) | Iterative division apparatus, system and method forming plural quotient bits per iteration | |
US5512896A (en) | Huffman encoding method, circuit and system employing most significant bit change for size detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: BOPS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WOLFF, EDWARD A.;MOLNAR, PETER R.;ELEZABI, AYMAN;AND OTHERS;REEL/FRAME:013723/0367 Effective date: 20021211 |
|
AS | Assignment |
Owner name: ALTERA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BOPS, INC.;REEL/FRAME:014683/0894 Effective date: 20030407 |
|
AS | Assignment |
Owner name: PTS CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALTERA CORPORATION;REEL/FRAME:014683/0914 Effective date: 20030407 |
|
AS | Assignment |
Owner name: ALTERA CORPORATION,CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PTS CORPORATION;REEL/FRAME:018184/0423 Effective date: 20060824 Owner name: ALTERA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PTS CORPORATION;REEL/FRAME:018184/0423 Effective date: 20060824 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |