WO2001009717A1

WO2001009717A1 - Video digital signal processor chip

Info

Publication number: WO2001009717A1
Application number: PCT/US2000/021191
Authority: WO
Inventors: Steven G. Morton
Original assignee: Morton Steven G
Priority date: 1999-08-02
Filing date: 2000-08-02
Publication date: 2001-02-08
Also published as: AU6756500A

Abstract

A digital signal processor device (10) includes a data cache (11); a scalar arithmetic unit (26) coupled to the data cache; and a parallel processing unit (20) coupled to the data cache, where the parallel processing unit includes an n row by m column array of parallel arithmetic units (23), and n pattern matcher coprocessors (24) individual ones of which are coupled to one of said n rows of parallel arithmetic units. The device can be programmed using one-dimensional and two-dimensional parallel data types, and a program for the device can be made using a method that defines an amount of parallelism, and then using a compiler to produce code having the same amount, or a lesser amount, of parallelism than used to define the program.

Description

Video Digital Signal Processor Chip

FIELD OF THE INVENTION:

This invention relates generally to digital data processors and, in particular, to digital data processors that are implemented as integrated circuits to process input data in parallel, as well as to techniques for programming such data processors.

BACKGROUND OF THE INVENTION:

Digital signal processor (DSP) devices are well known in the art. Such devices are typically used to process data in real time^ and can be found in communications devices, image processors, video processors, and pattern recognition processors.

One drawback to many conventional DSPs is their lack of parallelization, that is, an ability to apply multiple processors in parallel to the execution of desired operations on a given data set. As can be appreciated, the parallel execution of a plurality processors can yield significant increases in processing speed, so long as the multiple processors are properly controlled and synchronized.

In U.S. Patent No.: 5,822,606 the inventor described an improved digital signal processor device that overcame many of the deficiencies of prior art digital signal processors, such as with regard to their lack of adequate parallelization. This DSP is referred to in the subsequent description of the invention as the A236 chip, or simply as the A236, and the instant invention is described herein in part by highlighting the improvements made to the original A236 chip architecture, instruction set, and programming model.

OBJECTS AND ADVANTAGES OF THE INVENTION:

It is a first object and advantage of this invention to provide an even further improved digital signal processor device. It is a further object and advantage of this invention to provide a technique for programming the improved DSP.

It is another object and advantage of this invention to provide a method for programming a parallel processing device using one- and two-dimensional parallel data types.

It is another object and advantage of this invention to provide a method for programming a parallel processing device using a general purpose method for specifying the parallelism, and using a compiler to produce code for a device having the same amount, or a lesser amount, of parallelism than used to define the program.

It is another object and advantage of this invention to provide a parallel processing device having the ability to direct different groups of operands to different members of its parallel processing units, depending upon the instruction.

It is a further object and advantage of this invention to provide a technique for accessing multiple operands from a cache memory, where all of the operands can be accessed in a single clock cycle, regardless of the address of the first of the multiple operands and regardless of the placement of the set of parallel operands within one or more cache pages.

It is another object and advantage of this invention to provide an instruction word that can contain either a set of bits that defines a mode of operation, the control of a scalar processing unit and a parallel processing unit, executed as an entity, or a set of bits that defines a mode of operation and two sets of operations for a scalar processing unit, or a set of bits that defines a mode of operation and two sets of operations of a parallel processing unit.

It is another object and advantage of this invention to provide a technique for accessing an instruction from a cache memory where all of the bits of the instruction can be accessed in a single clock cycle, regardless of the address of the first byte of the instruction and regardless of the placement of the entire set of bits within one or more cache pages.

It is a further object and advantage of this invention to provide a digital signal processor device having instructions that feed one set of operands to a first group of parallel processing units, a second set of operands that contain a portion, but not all of, the first set of operands that are fed to a second group of parallel processing units, and so on. Furthermore, one set of operands may be fed to the first two groups of parallel processing units, a second set of operands that contain a portion, but not all of, the first set of operands are fed to third and fourth groups of parallel processing units, and so on.

It is another object and advantage of this invention to provide a technique for writing only selected operands to memory from a set of parallel processing devices, where the connection of the parallel processing devices to memory can vary from one instruction to the next.

It is yet another object and advantage of this invention to provide a digital signal processor device having a scalar arithmetic unit and a parallel processing unit comprised of an n row by m column array of parallel arithmetic units that are coupled to a data cache, and further including n pattern matcher coprocessors individual ones of which are coupled to one of said n rows of parallel arithmetic units.

SUMMARY OF THE INVENTION:

A digital signal processor device includes a data cache; a scalar arithmetic unit coupled to the data cache; a parallel processing unit coupled to the data cache, where the parallel processing unit includes an n row by m column array of parallel arithmetic units, and n pattern matcher coprocessors individual ones of which are coupled to one of said n rows of parallel arithmetic units. The device can programmed using one-dimensional and two-dimensional parallel data types, and a program for the device can be made using a method that defines an amount of parallelism, and then using a compiler to produce code having the same amount, or a lesser amount, of parallelism than used to define the program.

The device includes circuitry for directing different groups of operands to different one of the parallel arithmetic units, depending upon the instruction being executed, and also includes circuitry for accessing multiple operands from the data cache memory, where all of the operands can be accessed in a single clock cycle, regardless of the address of the first of the multiple operands and regardless of the placement of the set of parallel operands within one or more cache pages. The device uses an instruction word that can contain either a set of bits that defines a mode of operation, the control of the scalar arithmetic unit and the parallel processing unit, and that is executed as an entity, or a set of bits that defines a mode of operation and two sets of operations for the scalar processing unit, or a set of bits that defines a mode of operation and two sets of operations of the parallel processing unit.

The device also includes circuitry for accessing an instruction from the data cache memory, where all of the bits of the instruction can be accessed in a single clock cycle, regardless of the address of the first byte of the instruction and regardless of the placement of the entire set of bits within one or more cache pages. The device executes instructions that feed one set of operands to a first group of parallel arithmetic units, and a second set of operands that contain a portion but not all of the first set of operands to a second group of parallel arithmetic units. The device may also execute instructions that feed one set of operands to a first two groups of parallel arithmetic units, and a second set of operands that contain a portion but not all of the first set of operands to a third and fourth group of parallel arithmetic units. The digital signal processor device also includes circuitry for writing selected operands to memory from the parallel arithmetic units, where the connection of the parallel arithmetic units to memory can vary from one instruction to the next.

Data that is loaded into the data cache may be output from an image sensor, and may represent all or a portion of an image of a fingerprint. In this case the device functions as a high speed fingerprint pattern matcher, thereby enabling a number of valuable applications to be implemented and realized, such as weapon safety systems, doorlocks, and user identification authentication systems.

BRIEF DESCRIPTION OF THE DRAWINGS:

The foregoing and other aspects of this invention are made more apparent in the ensuing description of the preferred embodiments, when read in conjunction with the Drawings, wherein:

Figures. 1, 2 and 3 depict common patterns in the way data is stored and used in the A436 DSP;

Figure 4 shows a placement of parallel data types in memory;

Figure 5 is a block diagram of the A436 DSP;

Figure 6 is a block diagram of a basic 32b Instruction Word of the A436 DSP; Figure 7 is a block diagram of a Scalar Arimetic Unit of the A435 DSP;

Figure 8 is a block diagram showing 32 Parallel Processing Units of the A436 DSP;

Figure 9 is a block diagram of one of 32 Parallel Arithmetic Units of the A436 DSP;

Figure 10 shows an example of common types of image data presenting alternating patterns in memory;

Figure 11 is a chart showing an example of a hardware interrupt;

Figure 12 is a chart showing an example of a jump;

Figure 13 is a chart showing an example of a call;

Figure 14 is a chart showing an example of a return;

Figure 15 is a depiction of a weapon constructed to contain a fingerprint recognition system that employs the A436 DSP; and

Figures 16A and 16B are each an embodiment of an optical system for imaging a fingerprint, in accordance with an aspect of this invention.

DETAILED DESCRIPTION OF THE INVENTION:

My Ax36 family of video digital signal processor chips is highly optimized for handling live images and being programmed directly in C, the most widely used programming language. No embedded assembly language, API's (application programming interfaces), "canned" subroutine libraries or microcode are required.

Extremely high performance is achieved with moderate clock rates and conservative, low-cost fabrication using an efficient, highly parallel design. The architecture exploits the patterns with which image data is stored in memory. These patterns are represented by a series of hardware/software templates or parallel data structures. Many levels of hardware optimization result in much higher performance, smaller programs, easier I/O, improved memory utilization, smaller system size, lower power dissipation and lower cost than alternative processors.

Highly efficient programs that are short, run quickly and have a small memory footprint can easily be written using my ANSI-standard, parallel-enhanced C compiler. Time-critical loops can easily be re-coded using my parallel enhancements to C, and existing C functions that are not time-critical can simply be recompiled. Instructions that perform parallel processing are only 32 bits long, and micro-controller type instructions that only perform scalar processing are just 16 bits long, further reducing program size for ROM-based applications.

A simple but comprehensive parallel programming model is provided so that programs can be written once in C using my parallel processing extensions, and then be compiled for specific members of my Ax36 family that have varying degrees of parallelism. For a given chip fabrication technology, applications that require less performance than provided by the A436 can take advantage of scaled-down and reduced pin-count Ax36 chips to provide lower cost points. As we move to more advanced chip fabrication technology, customers can take advantage of Ax36 chips with less parallelism but higher clock rates than the A436 to provide lower cost points.

The A436 has a large (20x) increase in processing power compared to the A236. This is provided by a combination of increased clock rate, increased parallelism, improved Data Cache and Instruction Cache, more conditional execution capability, and the implementation of new parallel data types. The A436 has eight instances of an improved version of the A236's Parallel Processing Unit. Among many other uses, a low cost, single-chip plus memory, universal, realtime, fully software programmed, broadcast quality, standards-based or proprietary, wavelet- or DCT-based, video compressor can be implemented with the A436.

Significant improvements made by the A436, compared to the A236, include:

1) The number of parallel arithmetic units is increased from 4 to 32, the number of multipliers from 4 to 32, the number of pattern matcher coprocessors, referred to for convenience also as "motion estimation coprocessors", is increased from 1 to 8, and the number of general-purpose 16b registers from 256 to 1,152

2) The width of the Data Cache is increased from 64 to 128 bits and the sizes of all parallel operands have been increased to take full advantage of it.

3) Parallel operands can have as many as 16 adjacent bytes, can be placed on any memory byte address, and can be accessed, sign- or zero-extended to 16b, and processed in a single CPU cycle.

4) Two-dimensional, imaging-oriented parallel data types are added.

5) Special instructions for parallel implementation of matrix-vector multiply, convolution, matrix transpose, and histograms/table-lookups/erosion/dilation are added. 6) Bit-realignment and storage of parallel operands can be done in a single instruction

7) Short (16b) instructions have been added to reduce program size for micro-controller code

8) Jump-less, data-dependent scalar and parallel operations are provided Every instruction is conditional

9) An Audio IO Port, Bit-Programmable IO Port, second interval timer and Debug Port are added

B Benchmarks (@100 MHz CPU clock)

8 x 8 2-D DCT only 34 CPU cycles (340 ns), including reading packed or alternating bytes from memory, sign- or zero-extending them, matrix transpose and bit-realignment (only 3 additional CPU cycles to quantize)

16 x 16 motion-estimation only 4 CPU cycles (40 ns) on average, since eight, 8-pixel motion estimation calculations are performed simultaneously

8 x 8 matrix transpose of 16b words only 8 CPU cycles (80 ns), starting with the matrix in registers, or 0 (zero) CPU cycles as an internal part of a 2-D DCT

3 x 3 convolution of 8b signed or unsigned values using 16b coefficients only 0 44 CPU cycles/block (4 4 ns) on average since 32 convolutions are performed simultaneously, figure includes reading pixels from memory, bit-realignment to adjust for multiplying bytes by words, and storing the results in memory

JPEG compress 640 x 480 color image to 4 1 1 resolution, approx 10 ms erosion/dilation of 32 overlapping 3 x 3 blocks of pixels in memory using an arbitrary mapping function requires only 17 CPU cycles or an average of only 0 55 CPU cycle/block (55 ns/block)

C Enhancements for Image and Video Compression and Streaming

D Terminology

For convenience, each one of the 32 parallel processing units in the A436 is referred to as "vector processor"

Historically, however, a vector processor is a large entity that only performs arithmetic operations on entire matrices It is passed the address of an entire matrix and performs a single function upon that entire matrix, generally using floating-point arithmetic As a result, vector processors are generally limited to scientific computing

In comparison, the A436 has a scalar arithmetic unit and a parallel processing unit, which contains 32 parallel arithmetic units, all under control of a single 32b instruction

1) is much more general purpose and is intended for processing live images using fixed- point arithmetic

2) has a programming model that is parallel processing of C structures, not matrix algebra

3) is a RISC machine, not a CISC machine, and performs parallel computations using a Register_A OP Register_B — > Register B model

4) is suitable for real time operations because timing is predictable, with instructions being executed at the rate of one per CPU cycle 5) has a 32b instruction that fetches a relatively small and fixed number, 4 to 32, of operands, depending upon the instruction

6) performs a wide variety of operations, including boolean, arithmetic, matrix-vector multiplication, convolution and motion estimation

7) provides flexible pointer arithmetic for addressing parallel operands, and sliding- window addressing is provided for convolution, motion estimation, etc.

8) enables each parallel arithmetic unit to be enabled or disabled

9) has many general purpose registers

As such, the term "vector processor", as used herein, implies significantly more than what is implied by the conventional definition of the term.

The following abbreviations are used in this patent application:

or" b..a = items b down to a, including items b and a b# = bit # B# = byte #

0x# or #h = the # is in hexadecimal format #d = the # is in decimal format

ACx = accumulator "x", x = 7..0, there are two 32b accumulators in each parallel arithmetic unit; for a total of 8 accumulators per row of the Parallel Processing Unit FRx = Fixed Register "x", x = 3..0, most of these registers are always accessible in each parallel arithmetic unit GP = general purpose PAU = parallel arithmetic unit (also called a Vector Processor (VP)), there are 32 of them in the

A436 PPU = parallel processing unit (formerly called the Parallel Processor), there is 1 of them in the

A436 SAU = scalar arithmetic unit (formerly called the Scalar Processor), there is 1 of them in the

A436 SRx = Scalar Register "x", x = 31d..0, the lower 24 are general purpose, the upper 8 are special purpose VRx = Vector Register "x", these 32 registers are in each parallel arithmetic unit; four of them are FR3..0, four of them represent the two 32b accumulators, 16 of them represent 16 of the 32 Windowed Registers, and the rest are special purpose such as for accessing memory and the processor status word WRx = Windowed Register "x", x = 31d..0, these registers are in each parallel arithmetic unit, all are directly addressable by the matrix multiplication instructions while groups of 16 are accessible by other instructions

All references to the Data Cache Bus refer to the Crossbar Bus. E. Detailed Comparison of A236 and A436

F. Multimedia Parallel Data Types Supported by A436

Audio and image data typically have certain patterns in the way the data is stored in memory, and in the representation of each data sample. The A436 has I/O interfaces that place data into memory and read data from memory in such a way as to make it easy to process and to optimize the performance of the Data Cache. The A436 instruction set supports conversion from one way of representing data to another. It also supports saturated arithmetic to handle overflows that may occur, potentially turning a white pixel into a black one and vice-versa.

Eight-bit data can be signed (2's complement) or unsigned. Sixteen-bit data is signed (2's complement). Data can occupy successive locations in memory, which I call a packed format, or alternating locations in memory, which I call an interleaved format. Thirty-two bit data is handled in the A436's accumulators, but the A436 does not have a parallel data type that directly accesses it in memory in a single cycle.

Common sources of multimedia data that the A436 handles efficiently are:

1) monochrome digital image sensors, 8 to 16 bpp (bits per pixel), unsigned 2) color digital image sensors, 8 bpp, unsigned, using the Bayer 2G color filter format, which has alternating red-green pixels on one line then alternating green-blue pixels on the next line, and so on

3) coefficients with 8 to 16 bits, signed

4) monaural audio, 16b, signed

5) stereo audio, 16b, signed, alternating channels

6) video encoder and decoder chips, Y:U:V or YCrCb 4:2:2 (a pair of pixels in a line shares the same color information) with 8b per datum, 24 bpp, usually signed because it is derived data, but can be signed or unsigned, depending upon how the encoder or decoder is initialized

Other formats, such as Red-Green-Blue for driving a computer monitor, as opposed to a TV monitor, can also be handled.

Common patterns in the way multimedia data are stored in the A436 are shown in Figure 1. For example, in a single CPU cycle, v_add_tv«rps vra, vrb simultaneously computes A plus

B — > B upon a pair (A[31..0] and B[31..0]) of one-dimensional, 32-element, 16b parallel operands that have already been placed in the registers (VRn) of the parallel arithmetic units (VPn,m's), producing a 32-element, 16b parallel operand. The additions are performed in a pair- wise fashion as shown in Figure 2.

G. Simple Parallel Programming Model

The A436 has a unique, powerful and efficient structure processing instruction set. It is designed for programming directly in C, not via embedded assembly language, API's or subroutine libraries, to give good control of the code generated. It is not necessary to embed assembly language code in a C program to be able to make good use of the parallel processing capability, as is required by microprocessors having multimedia capability.

The A436's instruction set is specially designed for real-time image processing. The A436 architecture exploits the low-level patterns with which image data is stored in memory. These patterns of storage, as opposed to patterns in the data itself, are represented by a set of hardware/software templates or parallel data structures. The access and manipulation of data using these templates are directly implemented in the A436 instruction set and by primitives in my parallel-enhanced ANSI-standard C compiler to make them easily accessible. Critical loops can easily be recoded using the parallel data types provided, and non-critical code written in C can simply be recompiled.

Programs written for the A436 in C using the parallel enhancements I provide can easily be recompiled to run on other members of the Ax36 family. For a given chip fabrication technology, applications that require less performance than provided by the A436 can take advantage of scaled-down versions of the A436 to provide even lower cost points. As we move to more advanced fabrication technologies, Ax36 chips having less parallelism but a higher clock rate than the A436 can also provide lower cost points.

Just as one defines scalar variables in terms of scalar data types that are supported by the hardware, in the A436 one defines parallel variables in terms of the parallel data types that are supported by the hardware. When you use a variable that has been declared in terms of a parallel data type, the C compiler selects the instruction code that instructs the A436 to access and process the particular parallel data type required. Address pointers can readily be auto- incremented, and the stride or offset of the increment can easily be specified, depending upon the locations in memory of successive parallel operands.

In a single 32b instruction, the A436 operates in parallel upon two, one- or two-dimensional parallel (multi-element) data structures. As many as 64 pairs of 8b operands or 32 pairs of 16b operands can be processed by a single instruction. The A436 is especially efficient at simultaneously processing multiple operands that are read from memory and used multiple times, all in a single instruction. This occurs frequently for many important imaging functions including matrix-vector multiplication, convolution and motion estimation.

Instructions that perform parallel processing have only 32 bits (except for 64b "extended" instructions that have 32b immediate operands). They fully specify the type of parallel operand to be used and the operation of both the parallel processing unit and the scalar arithmetic unit. Instructions that provide only scalar processing for micro-controller functions have only 16 bits, further reducing program size. In a single 32b instruction that performs both scalar and parallel processing, the instruction formats, where X is a scalar operand, Y is a parallel operand, OPS is a scalar operation and OPV is a vector (or parallel) operation, are:

format #1 : Scalar: (A OPS B ® B) and Vector: (C OPV D ® D), for logic and arithmetic operations format #2: Scalar: (A OPS B ® B) and Vector: (C OPV D) + E ® ^ for matrix- vector multiplication, convolution and motion estimation

Sixteen-function scalar and vector ALUs are provided. Special functions, such as memory access and accumulators, are mapped into the registers. A scalar memory read is specified by simply referring to memory as the "A" operand. A scalar memory write is specified by referring to memory as the "B" operand. Three registers are used in the scalar arithmetic unit to specify 8b, 16b or 32b operands. The parallel arithmetic unit operates similarly except that the choice of parallel data type is specified in the opcode rather than a register. Often, the scalar arithmetic unit does an address calculation to support the fetching or storing of a parallel operand by the parallel processing unit.

Each instruction that performs parallel processing specifies the parallel data type to be used. When data in memory is read or written, the selection of a particular parallel data type determines how much memory data is to be accessed (4, 8 or 16 bytes, or 4 or 8, 16b words), the precision (8b or 16b) and representation (signed bytes or words, or unsigned bytes) of the memory data, and the organization of the data in memory (packed or alternating).

At the C compiler level, one defines multi-element variables in terms of the parallel data types supported by my parallel-enhanced C compiler. One can use parallel data types supported by a given Ax36 chip, or let the C compiler map a larger parallel data type to multiple smaller parallel data types for you.

When one operates upon these variables using normal C operators, the compiler uses the specification of the data types being used to select the proper A436 instruction or instructions to do the processing. Thus if one processes a line of video data and the processing is done using variables that have been declared as a parallel data type (instead of a scalar data type) that specifies sixteen bytes packed signed, each C statement processes sixteen operands instead of one. The C compiler takes care of all of the details for you, such as register allocation, memory access, etc.

Many powerful C constructs map to a single 32b instruction in the A436. A relatively low CPU clock rate is used because a large amount of parallel processing is provided by each instruction. In addition, a very shallow instruction pipeline is used in the A436. This enables the C programmer to use the C language for one of the key things it was intended to do, provide a high degree of control over the code created. It is far, far easier to optimize a C program for the A436 than it is to optimize a program for processors that use deep and complex pipelines.

The C compiler produces an assembly language output that is automatically assembled to object code if desired. The assembly language output can be viewed to assess the quality of the code generated and can be edited if desired.

A single, linear memory address space is used. The Data Cache provides desired operands when they are addressed; it is not necessary to explicitly move them into a local memory to be able to access them. A separate Instruction Cache provides instructions to the Instruction Unit.

The parallel processing unit performs like-processing on multiple like-operands at the same time. It contains 32 tightly coupled 16b parallel arithmetic units. They are organized as 8 rows with 4 parallel arithmetic units in each. An in-line conditional execution capability enables each of the parallel arithmetic units to perform data-dependent processing. Each row of four parallel arithmetic units is tightly coupled to a pattern matcher coprocessor, also referred to herein for convenience as a motion estimation coprocessor, for a total of eight motion estimation coprocessors. More than 1,000 16b general-purpose registers are provided in the parallel processing unit, so many coefficients and motion estimation search targets can be stored and accessed instantly.

With one exception, which enables the simultaneous processing of 64, 8b operands instead of 32, 16b operands, 8b memory operands are automatically converted to 16b precision on a memory read, and either truncated or saturated back to 8b precision on a memory write. The conversion is done according to whether the parallel data type specifies that the bytes are signed or unsigned. Parallel 16b and 32b operands are also supported.

In addition to specifying the precision and format of parallel data, two patterns of data storage in memory are also supported, packed and interleaved. Packed means that successive operands are adjacent to one another in memory. Interleaved means that successive operands are in alternating locations in memory, i.e., if byte data is being referenced, then alternating bytes of data are accessed rather than every byte. Parallel data in memory can be read and written using these formats.

The specification of a parallel data type by an instruction also chooses one of several two- dimensional configurations for the 32 parallel arithmetic units Configurations include 8 rows x 4 columns, 8 rows x 8 columns (each parallel arithmetic unit handles two 8b values), 4 rows x 8 columns and 2 rows x 16 columns. These configurations determine how a series of bytes in memory is accessed by the parallel arithmetic units.

For example, in a single CPU cycle, v_add_twps vra, vrb simultaneously computes A plus B ® B upon a pair (A[31..0] and B[31..0]) of one-dimensional, 32-element, 16b parallel operands that have already been placed in the registers (VRn) of the parallel arithmetic units (VPn,m's), producing a 32-element, 16b parallel operand. The additions are performed in a pair- wise fashion as shown in Figure 2.

Similarly, in a single CPU clock cycle, v_mvm_oqwps vm, r0 , AC0 performs a matrix- vector multiplication ([M] X [V] ® [AC]) in parallel. Thirty-two multiplies and adds are done simultaneously. In this example, one 4-element, 1-D parallel operand (V3..V0) is read from memory (vm) in quad word packed signed format (four, 16b signed values are adjacent to one another in memory). It is multiplied by an 8-row (hence the term "octal") by 4-column (hence the term "quad") 2-D parallel operand (the matrix) that is stored with 16b precision in one of the 32 windowed registers (wrO ) in the parallel arithmetic units (VPn,m). The result is an 8- element parallel operand. Each element is stored in one (ACO) of the eight 32b accumulators in each row of the Parallel Processing Unit as shown in Figure 3.

Addresses and loop counts are computed by the scalar arithmetic unit, which is also responsible for program flow and interrupt handling. Operations can be performed in a single instruction between operands in memory and operands in registers with the results being stored in registers, between operands in registers with the results being stored in registers, and on an operand in a register with the result being stored in memory.

An internal I/O system provides numerous powerful internal I/O devices, including three video- aware and packet capable DMA controllers. The I/O system is controlled by "extended" registers that are an extension of the registers in the scalar arithmetic unit. It is supported by reduced length (16b) instructions to reduce program size.

Instructions are executed at the rate of one per CPU cycle. Most instructions have 32 bits. The exception is "extended" instructions, which have 64 bits because they contain a 32b immediate operand.

In the execution of a single 32b instruction when a parallel operand is read from memory:

(1) A general purpose register (specified by the scalar register B field) in the scalar arithmetic unit provides an address,

(2) all elements of the parallel operand are simultaneously read from memory in a single CPU cycle regardless of the memory address,

(3) data format conversion (parallel 8b signed or unsigned operands are converted to parallel 16b signed operands, or parallel 16b operands are unchanged) is done upon the parallel operand read from memory in the parallel processing unit,

(4) a one- or two-dimensional parallel operand from registers and the converted parallel operand from memory are processed, with the results being stored in registers in the parallel processing unit, and

(5) the next address is computed by the scalar arithmetic unit and the general purpose register is updated.

In the execution of a single 32b instruction when a parallel operand is written to memory:

(1) A general purpose register (specified by the scalar register B field) in the scalar arithmetic unit provides an address, (2) a 16- or 32b parallel operand in registers is rotated, if desired, in each parallel arithmetic unit,

(3) if the data type is byte, the 16b operands in the parallel arithmetic units are saturated (signed or unsigned) or truncated to 8 bits as they are sent to memory,

(4) all elements of the parallel operand are simultaneously written to memory in a single CPU cycle regardless of the memory address, including packed and alternating memory formats, and

In the execution of a single 32b instruction when no memory access for parallel operands is required:

(1) One or two parallel operands in registers are processed in each parallel arithmetic unit with the result being written to a register in each parallel arithmetic unit, and

(2) the scalar arithmetic unit is available for other tasks such as computing loop counts.

H. Summary of Parallel Data Types Supported

The representation of data is important in any application. The way that image data is represented is especially important because so much of it must be acquired, buffered and processed quickly to handle live images, especially video images, in real time.

The A436's parallel programming model is supported by a wide variety of one- and two- dimensional parallel data types to handle image data very efficiently. In the A436, a one- dimensional parallel operand is structured as one row by M (4, 8 or 16) elements, while a two- dimensional parallel operand is structured as N (usually 8) rows of 4 elements each.

The number of elements in all parallel operands referenced in memory has been doubled or quadrupled compared to the A236 to take full advantage of the 128b fully non-aligned Data Cache in the A436 and the increased number of parallel arithmetic units. In a typical instruction, a one-dimensional parallel operand is fetched from memory and used to operate upon either a one- or two-dimensional parallel operand that is stored in registers.

One-dimensional parallel operands from memory can be loaded into specified groups of parallel arithmetic units using a load-store type architecture, or can be broadcast to all groups of parallel arithmetic units for computations. The grouping of the parallel arithmetic units varies with the type of parallel operand. An enable bit in each parallel arithmetic unit enables parallel conditional operations to be performed by enabling/disabling any combination of parallel arithmetic units.

The seven 1-D parallel data types in the A436, which are natively supported by the hardware, i.e., at the instruction set level, for memory access (read and write) are:

1) quad word interleaved signed (_q is) - four 16b words in alternating locations in memory; all words used in the parallel processing unit are 16 bits and 2's complement

2) octal byte interleaved signed |unsigned (_obis | u) - eight bytes in alternating locations in memory; all bytes used in the parallel processing unit are 8 bits and signed (2's complement) or unsigned; replaces quad byte interleaved signed|unsigned (_qbis | u) in the A236

3) octal word packed signed (_owps) - eight 16b words in adjacent locations in memory; replaces quad word packed signed (_qwps) in the A236

4) sixteen byte packed signed|unsigned (_sbps | u) - sixteen bytes in adjacent locations in memory; replaces quad byte packed signed|unsigned (_qbps | u) in the A236

5) sixteen byte compressed signed (_sbcs) - sixteen bytes in adjacent locations in memory, accessed like octal word packed signed but each parallel arithmetic unit functions as two 8b arithmetic units; replaces octal byte packed unsigned (_obpu) in the A236

These parallel data types are placed in memory as follows (X = don't care), Ptr addresses the lowest byte, as shown in Figure 4.

The following additional 1-D parallel data types will be natively supported by my parallel- enhanced C compiler; they require multiple instructions to load or store in the A436, but can be processed in a single C statement: 1) octal double- word packed signed (_odps) - eight 32b signed values; used to represent products and sums-of-products that are stored in accumulators, in adjacent memory locations

2) sixteen double-word packed signed (_odps) - sixteen 32b signed values; used to represent products and sums-of-products that are stored in accumulators, in adjacent memory locations

3) thirty-two double-word packed signed (_tdps) - thirty-two 32b signed values; used to represent products and sums-of-products that are stored in accumulators, in adjacent memory locations

4) sixty-four double-word packed signed (_f dps) - sixty-four 32b signed values; used to represent products and sums-of-products that are stored in accumulators, in adjacent memory locations

5) thirty-two byte packed signed (_tbps) - thirty-two 8b signed values in adjacent memory locations

6) thirty-two byte packed unsigned (_tbpu) - thirty-two 8b unsigned values in adjacent memory locations

7) thirty-two byte interleaved signed (_tbis) - thirty-two 8b signed values in alternating memory locations

8) thirty-two byte interleaved unsigned (_tbiu) - thirty-two 8b unsigned values in alternating memory locations

9) sixty-four byte packed signed (_fbps) - sixty-four 8b signed values in adjacent memory locations

10) sixty-four byte packed unsigned (_fbpu) - sixty-four 8b unsigned values in adjacent memory locations

11) sixty-four byte interleaved signed (_fbis) - sixty-four 8b signed values in alternating memory locations 12) sixty-four byte interleaved unsigned (_fbiu) - sixty-four 8b unsigned values in alternating memory locations 13)thirty-two word packed signed (_twps) - thirty-two 16b signed values in adjacent memory locations

14) thirty-two word interleaved signed (_twis) - thirty-two 16b signed values in alternating memory locations

15) sixty-four word packed signed (_fwps) - sixty-four 16b signed values in adjacent memory locations

16) sixty-four word interleaved signed (_f is) - sixty-four 16b signed values in alternating memory locations

The following two-dimensional parallel data types are natively supported by the A436 hardware, i.e., at the instruction set level, for use with multiply (matrix-vector multiply, convolution, convolution-pair) and the motion estimation instructions that process two-dimensional data structures:

1) octal quad word packed signed (_oq ps) - for matrix-vector multiplication and convolution

2) octal quad word interleaved signed (_oqwis) - for matrix- vector multiplication and convolution

3) octal quad byte packed signed (_oqbps) - for matrix- vector multiplication and convolution

4) octal quad byte packed unsigned (_oqbpu) - for matrix-vector multiplication and convolution

5) octal quad byte interleaved signed (_oqbis) - for matrix-vector multiplication and convolution

6) octal quad byte interleaved unsigned (_oqbiu) - for matrix-vector multiplication and convolution

7) octal octal byte packed signed (_oobps) - for motion estimation 8) octal octal byte packed unsigned (_oobpu) - for motion estimation

The use of the modifiers with 2-D data structures is as follows:

1) The first modifier, e.g., "octal", refers to the fact that each instruction operates upon a 2-D data structure that has eight rows (since the parallel processing unit has eight rows) and uses a common input, which is usually read from memory and is specified by the remaining modifiers.

2) The second modifier, e.g., "quad" or "octal", specifies the number of operands in each row of the data structure, four or eight, respectively.

3) The third modifier, "byte" or "word", specifies the size of each input operand in memory, 8 bits or 16 bits, respectively.

4) The fourth modifier, "packed" or "interleaved", describes the physical relationship of operands to one another in memory; packed means they are adjacent to one another in memory and interleaved means they alternate in memory with like-size operands that are ignored.

5) The fifth modifier, "signed" or "unsigned", specifies the coding of parallel input operands in memory; words are always signed (2's complement), bytes can be signed (2's complement) or unsigned and are converted to 16b operands by being extended with eight sign bits or zeroes, respectively.

When the second modifier is "quad" and the operand size is "byte" or "word", a total of 32, 16b coefficients, which are stored in the windowed registers in the parallel arithmetic units, are used by each instruction if the full capability of the instruction is used. When the second modifier is "octal" and the operand size is "byte", a total of 64, 8b values, are used each instruction if the full capability of the instruction is used

I. Figure 5 shows a block diagram of the A436 Video DSP (VDSP) chip 10 in accordance with the presently preferred embodiment.

As shown in Figure 5, in a typical application the program is loaded into external SDRAM via a Host Parallel DMA Port 15, and data that has been processed/compressed by the A436 chip 10 is passed from external SDRAM to an external processor under control of the Host Parallel DMA Port 15, which is programmed to operate in packet mode. Then 8b or 16b, progressive scan, or interlaced or non-interlaced, image/video data is received by the Left Parallel DMA Port 14 and stored in external SDRAM in a circular, multi-frame input buffer that is controlled by the Left Parallel DMA Port 14. This is done in the background continuously without any cpu intervention. The Left Parallel DMA Port 14 is programmed to operate in video (or stream) mode. An Instruction Unit 16 receives an end-of-frame interrupt and reads the most recently received frame of data from the circular input buffer in external SDRAM, processes it and stores it in another circular buffer, the output buffer, in external SDRAM. Data from the circular output buffer in SDRAM is read by a Right Parallel DMA Port 18 and sent to an output device for display. This is done in the background continuously without any cpu intervention. The Right Parallel DMA Port 18 is programmed to operate in video (or stream) mode.

J. Block Diagram of Basic 32b Instruction Word in A436 Chip 10

Referring also to Figure 6, instructions are of the form A OP B — » B. The vector instruction executes one CPU cycle after the scalar instruction. The scalar instruction typically provides a memory address to reference a parallel operand that is used the next cycle by the vector processors/ parallel arithmetic units 20. Some instructions decode additional fields in the instruction word, particularly the vector opcode and scalar opcode.

K. Summary of Instructions, Registers and Condition Codes

The format and operation of a typical 32b instruction, which executes at the rate of one per CPU clock, is: s_add srl , sr2 ; v_add_sbps vm,vr0

Scalar register 2 (sr2) before addition is the address for vector memory (via) read, srl (the address offset) is added to sr2 to form the next address, a sixteen byte packed signed (_sbps) operand is read from memory in a single cycle regardless of address, each of the sixteen, 8b signed values is sign-extended to 16b, and, for the parallel arithmetic units that are enabled, the sign-extended 0^th memory operand is added to vrO in parallel arithmetic units 0 and 16, the signed-extended 1^st memory operand is added to vrO in parallel arithmetic units 1 and 17, etc. for all active units of the 32 parallel arithmetic units 20.

Scalar (s) (instruction b3..0) and Vector (v) (instruction 7..14) ALU Functions, shown in hex

0: A AND B ( and) 4 : not A AND B 8: A plus CFF OxC: A minus 1

( andns) ( addsc) (_dec)

1: AXORB (_xor) 5 : A XNOR B 9: A plus B plus CFF OxD: A plus B

( xnor) ( addc) (_add)

2: A ORB ( or) 6 : n o t A OR B OxA: A plus notB plus OxE: A minus B

( orns) CFF ( subsdc) ( subsd)

3: A ( move) 7: not A ( moven) OxB: not A plus B plus OxF: B minus A CFF ( subdsc) ( subds )

Scalar Re isters instruction b!3..9 for A (read) and b8..4 for B read/write , shown in hex

Vector Registers (instruction b27..23 for A (read) and b22„18 for B (read/write)), shown in hex

Condition Codes for Scalar and Vector, shown in hex

Note: Special cases are not shown in any of the above groups.

L. Block Diagram of Scalar Arithmetic Unit in A436 Chip 10

As shown in Figure 7, a single (or scalar) operand from memory or a register, and a second operand from a register, are passed through the ALU and stored in a register, updating the second operand. Or, an operand from a register is passed through the barrel shifter or saturation logic and stored in memory. The memory address is sent to the Data Cache to address a scalar or parallel operand that is used the next CPU cycle.

M. Block Diagram of Parallel Processing Unit 20 in A436 Chip 10

As shown in Figure 8, each one of the 32 vector processors (VPn's), also known as parallel arithmetic units 22, can access almost any 8b or 16b in a parallel operand as large as 16 bytes. The choice of which operand is accessed by a given vector processor is determined by the instruction being executed. All vector processors in a row of the Parallel Processing Unit 20 cooperate on motion estimation / pattern matching and sums of products. The scalar arithmetic unit (not shown in Figure 8) can access any of the motion estimation (ME) coprocessors 24, also referred to as pattern matcher coprocessors, or parallel arithmetic units 22 via the Scalar I/O Bus. N. Block Diagram of One of 32 Parallel Arithmetic Units 22 in A436 Chip 10

As shown in Figure 9, an operand from memory or a register, and an operand from a register are passed through the ALU or the multiplier. If passed through the ALU, the result is stored in a general-purpose register; if passed through the multiplier, the result is stored in a 32b accumulator. Or an operand in a register is passed through the barrel shifter or the saturation logic and sent to memory. Four parallel arithmetic units 22 in a row of the Parallel Processing Unit 20 cooperate for motion estimation and sums of products.

A436 Instruction Set

A. Summary and Scalability

This section describes the differences between the instruction sets for the A236, which uses my third generation Ax36 core, and the A436, which uses my fourth generation Ax36 core. It also summarizes changes to the hardware, including internal I/O devices, that affect the programming of the A436.

The A436 has a highly modular design. It is designed in a hardware description language in a highly modular fashion. As a result the A436 can readily be implemented with a variety of chip fabrication technologies. In addition, since the A436 is a parallel processor, the amount of parallelism and thus the amount of processing power can be scaled to provide a wide range of capabilities and thus price points in the Ax36 family.

Programs that are written using the parallel data types I provide in my ANSI-standard, parallel- enhanced version of C can easily be recompiled to run on the various members of my Ax36 family. For a given fabrication technology, applications that require less performance than provided by the A436 can take advantage of scaled-down and reduced pin-count members of the Ax36 family to provide even lower price points.

The key to being able to develop programs in C (or, in principle, in other languages, too) and compile them for use with diverse members of my Ax36 family is that the local parallelism inherent in an application must be expressed using my parallel enhancements to C. By local parallelism I mean the amount of parallelism that is found in a relatively small number, typically 64, of pixels that are near one another in a 1-D or 2-D region. This can be done in a general purpose fashion using the parallel data types and parallel programming model that I provide in my parallel-enhanced C compiler. The full set of parallel data types I provide in my C compiler does not necessarily match the amount of parallelism provided in any particular member of my Ax36 family.

Once this parallelism is expressed by the software engineer who is developing an application, library or code module, my parallel-enhanced C compiler can produce efficient code that executes quickly and efficiently for the various members of my Ax36 family. This is done even when the amount of parallelism provided in a given Ax36 chip is different from the amount of parallelism with which the program is written.

This method of developing programs using an idealized model of parallelism and then targeting a specific member of the Ax36 family for implementation is especially useful for developing code for an Ax36 chip that has a large amount of parallelism, and then reusing that code on an Ax36 chip that has less parallelism. This is particularly useful as clock speeds increase over time, enabling an Ax36 chip with less parallelism but higher clock speed to replace a physically larger and more expensive Ax36 chip that has more parallelism. As my family of Ax36 chips grows, parameters in my C compiler will enable one to specify the Ax36 chip for which code is to be produced. These parameters in turn will specify the amount and nature of parallelism provided by a given Ax36 chip so that efficient code can be produced for it automatically by the compiler.

B. Changes to A236 Instruction Set Architecture

1. Number of Parallel Arithmetic Units Increased Eight-fold

The Parallel Processing Unit 20 for the A436 chip 10 contains eight rows of four parallel arithmetic units 22 each, thus forming an n row by m column array of parallel arithmetic units 22., where n is greater than one. Previously there was only one row. Each set of four parallel arithmetic units 22 in a row is connected to a pattern matcher coprocessor, also referred to herein as a motion estimation coprocessor 24.

Each instruction that accesses memory, except for the new multiplication instructions, reads or writes the largest number of parallel operands possible. Instructions that previously referenced four operands now reference either four, eight or sixteen, to make the best use of the increased width of the Data Cache 11 (see Figure 5). The A436 chip 10 behaves like the A236 when only the parallel arithmetic units 22 in row 0 of the parallel processing unit 20 are enabled by their respective vector processor enable bits. Multiple sizes of operands all result in the execution of the same opcode.

Saturation of data is performed according to whether it is signed or unsigned when parallel byte operands are written from the parallel arithmetic units 22, which have 16b precision, to memory.

2. Number of Motion Estimation / Pattern Matching Coprocessors 24 Increased Eight-fold

The A436 chip 10 can process 64 pixels per CPU clock to perform motion estimation and pattern-matching operations, providing extremely high performance for the most computationally demanding part of video compression. Each one of the eight rows of the Parallel Processing Unit 20 has its own 8-pixel, motion estimation / pattern matching coprocessor 24, so there are now eight motion estimation / pattern matching coprocessors 24 instead of one. Each one produces its own set of results.

Each one of the coprocessors 24 computes the sum of the absolute values of the differences of 8 pairs of pixels. These sums are accumulated to implement large match windows. Each instruction, one set of pixels is read from memory, the other set of pixels is read from registers, and computations are performed upon them. There are 32, 16-bit general purpose registers in each one of the 32 parallel arithmetic units 22, so large numbers of pixels can be stored. This enables large match windows or many sets of match targets to be processed quickly.

Motion estimation and pattern matching is handled by the PixDist and PixBest instructions.

Two classes of these instructions are provided, Staggered and Linear. The Staggered instructions are typically used for simultaneously testing eight partially overlapping locations in memory for the same eight pixels. The octal octal byte packed signed|unsigned parallel data types are supported. The Linear instructions are typically used for simultaneously testing the same eight pixels in memory for eight different sets of pixels. The octal octal byte packed | interleaved signed|unsigned parallel data types are supported. For a given position of a search window, the PixDist instruction is used for all match operations except the last. The PixBest instruction terminates one set of match operations and updates the registers that indicate the best matches found and their locations.

The Scalar IO address for the original coprocessor has been changed so that all eight coprocessors 24 can be addressed in a common block. In addition, the number of bits in the PixBest register is increased from 8 to 12 to handle larger search distances..

3. Scalar Arithmetic Unit 26 (Figure 7)

All 24b data paths except for the program counter and special address registers are increased to 32 bits including immediate data from 64b extended instructions. An immediate operand can be used as a memory address by referencing it with the scalar register B address, even though an immediate operand is read-only. Special address registers are increased to 26 bits.

4. 16b Dual scalar Instructions Added, Jump Targets Changed to Two Bytes and Relative Jump Addressing Added

The A436 chip 10 supports 16b dual scalar instructions in addition to 32b scalar/parallel instructions. When no vector operations are required in a series of instructions, the 16b instructions are useful for micro-controller functions to reduce program size.

Programs are written without regard to dual scalar instructions. A pseudo-op in the program instructs the software tools to attempt to combine pairs of scalar instructions into their short form. When this can be done, the software tools automatically combine a pair of scalar instructions into a single 32b word and give it the opcode modifier for dual scalar. When there is a series of scalar instructions that do not have any vector or immediate-data portions, the tools group the scalar instructions so that all pairs of short scalar instructions are formed on 4-byte address boundaries. A 32b scalar/vector instruction with a vector NOP is used at the beginning and/or end of the sequence when no pair can be formed. Transfers-of-control are to addresses on any four-byte boundary, or can be two-byte boundaries if the target is in dual scalar format. Jump and Call instructions contain the transfer-of-control address, which is two-byte aligned (previously four-byte aligned). There are 26 bits (the lsb is always 0) for the address giving a maximum program size of 64 MB. Conditional jump instructions that specify an address as a 21b immediate operand allow either an absolute or relative transfer of control. Call instructions that specify a 25b immediate operand allow either an absolute or relative transfer of control. Jump and Call instructions also allow the use of a register (including a 32b immediate operand) to specify the absolute or relative address. Interrupt vectors specify a full, 26b address (lsb = 0) for the interrupt service routine.

The presence of two short scalar instructions in a single 32b word is signaled by the opcode modifier b31 .28 having a value of 9. The two short scalar instructions execute as though they each had 32 bits, with a vector portion that specifies a vector NOP. All 32b instructions are placed on 4-byte address boundaries, as before.

Each short scalar instruction actually has only 14 bits. The combination of two, 14b instructions plus a 4b opcode modifier that indicates the presence of a pair of short scalar instructions, gives a total of 32 bits, for the equivalent of 16 bits per short scalar instruction. Thus the lsb of the PC selects the lower or upper set of 14 bits in a 32b instruction word, rather than selecting a 16b word.

The format for dual scalar instructions is:

5. Memory Address Size Increased and Allocation of Memory to Program and Data

The memory address space is 64 MB, which uses a 26b address. This affects the Instruction Cache 28, Data Cache 11, Memory Interface 12 and Parallel DMA Ports 14, 15 and 18 (see Figure 5).

In A436 vl, 16- and 32b wide SDRAMs in 16- and 64-Mbit configurations are supported, with a maximum of two banks. This allows up to 32 MB of memory to be used.

A single, linearly-addressed space is used for both program and data. The only rules on the usage of memory are:

1) the program counter is cleared by reset, so the program must start at address 0;

2) the push-down stack is stored in memory and can be of any size;

3) each interrupt service routine is addressed by an interrupt vector that is fully programmable in the interrupt controller and can be located anywhere in memory; no common jump table is required; 4) jump and call targets are four-byte aligned unless the target is in dual scalar format, in which case the target can be two-byte aligned;

5) data segments and program segments must be 64-byte aligned and do not share 64- byte pages because there is at present no mechanism to determine whether or not the same 64-byte page of memory has been loaded into both the Data Cache 11 and the Instruction Cache 28;

6) the parallel DMA ports 14, 15 18 can locate their buffers anywhere in memory, starting on 64-byte boundaries, with the locations being specified by control registers that are loaded by the program; the amount of storage that is allocated to each scan line of an image should not be a power of two but is ideally a (power of two) +/- 64 bytes to optimize the performance of the Data Cache 11;

7) there is no cache snooping; the parallel DMA ports directly read from or write to SDRAM regardless of any like-addressed data in the Data Cache 11 or Instruction Cache 28; and

8) the Data Cache 11 is a write-back cache not a write-through cache, and no mechanism is provided at present for a DMA port to obtain its data from the Data Cache rather than SDRAM, so the program writes back any updated data that is in the Data Cache 11 and has not been moved back to SDRAM.

6. Instruction Cache Improved

The capacity of the Instruction Cache 28 is 4 KB. Its width on the instruction unit side is 64b. The number of cache "ways" is two. The Instruction Cache 28 is read-only by the instruction unit 16. All memory operands with the exception of instructions are handled by the Data Cache 11. It can handle programs that are located throughout the 64 MB address space of the A436 chip 10.

A fully non-aligned 32b implementation is used. This means that 64b, extended instructions (instructions with 32b immediate operands) can be placed on any 4-byte boundary, even if cache pages are crossed. This eliminates the need for NOPs to pad the alignment of 64b extended instructions to 8-byte boundaries. The instruction cache controller can handle a jump to an instruction that crosses two cache pages, even if neither page is initially in the cache, in which case two cache misses would occur.

The Instruction Cache 28 provides the Interrupt Controller 30 the ability to look ahead to the next instruction except when an extended instruction is placed on an 8-byte boundary. The Interrupt Controller 30 uses a "cautious" design to automatically inhibit interrupts to prevent the breaking, or potential breaking, of critical pairs of instructions by an interrupt. The number of cases for which interrupts must be inhibited is reduced compared to the A236.

Since the Instruction Cache 28 is read-only, the entire Instruction Cache 28 can be deallocated with a single instruction. The s_f lush_#n instruction is replaced by s_f lush, without any operands. Three NOPs are programmed immediately after a s_f lush instruction to give the

Instruction Cache 28 time to recover.

s_flush, instruction code: b31..28 (opcode modifier) = OxA, b3. 0 (scalar opcode) = OxD

7. Data Cache Improved and Flush Instructions Changed

The width of the Data Cache 11 is increased to 128 bits to be able to handle larger parallel operands. The capacity of the Data Cache 11 is increased 8-fold to 8 KB. The number of "cache ways" remains two.

A non-aligned parallel operand is an N-byte operand that is not placed on an N-byte address boundary. The Data Cache 11 fully supports non-aligned scalar and parallel operands in a single cycle. Parallel operands as long as 16 adjacent bytes can be placed on any memory byte address, and be accessed and processed in a single CPU cycle, even when the parallel operand spans two cache pages. If a parallel operand is not in cache, the cache controller can handle an operand that crosses two cache pages, even though two page faults occur to access that one operand.

All memory operands with the exception of instructions are handled by the Data Cache 11.

Since images have highly repetitive data structures composed of many lines of image data of identical length, programmers must ensure that the Parallel IO Ports 14, 15, 18 place each line on a memory address that is not a power of two to avoid erroneous results in the Data Cache 11. A line buffer size that is 64 bytes more or less than a power of two is ideal.

The v_flush_#n instruction of the earlier A236 is replaced by v_ riteback sra, srb. The contents (0..127) of srb select which page in the Data Cache 11 is to be saved to DRAM if it is "dirty." Its page is de-allocated in any case. The contents of the sra operand should be "1" to count through the pages. The ALU operation is "A plus B".

v_ riteback sra, srb, instruction code: b31 .28 (opcode modifier) = OxA, b3..0 (scalar opcode)= OxC

The operating modes for the Data Cache 11 are:

1) memory read - save a dirty page to SDRAM if necessary before allocating a new page and reading the new page from SDRAM, then read the operand from cache and send it to the processor;

2) memory write - save a dirty page to SDRAM if necessary before allocating a new page and reading it from SDRAM, then the processor writes an operand to the cache;

3) write without read first - same as memory write but no page is read from SDRAM. This can be used to advantage for filling display buffers, where every pixel gets changed, so it is a waste of time to do a read first, and for temporary buffers or "scratch pads";

4) write back - the page indexed (not addressed) by srb is written back to SDRAM only if is dirty. The page is then de-allocated whether it is dirty or not.

8. Summary of Scalar Registers

The scalar registers 26A are located in the scalar arithmetic unit 26, as are the program counter 26B, the stack pointer 26C and the ALU 26D (see Figures 6 and 7). Several registers (Q and

SPSW) in the A236 have been deleted and others have been assigned new register addresses.

The scalar registers 26A are as follows for both scalar register A and scalar register B: 0xl7..0 - srl7 . . 0, general purpose registers (as in A236 except that registers 0x16 and 0x17 are added).

0x18 - imm (read-only), immediate operand from instruction register. Selection as scalar register B in s_move sra, imm gives a scalar nop (s_nop, neither the scalar registers nor scalar status bits are changed and no 64b instruction is created unless sra is imm). s_move imm, imm, which gives s__nop because srb is imm, loads imm into the memory address and the memory address register. A one-cycle interrupt inhibit occurs whenever sra and srb are both imm since it is not at present possible to look ahead to the next instruction to determine whether or not the value is used as a memory address by the next instruction.

0x19 - smemδ, gives 8b memory operands. Gives a memory read when selected as scalar register A and the 8b operand is sign-extended to 32 bits. Gives a memory write when selected as scalar register B; only b7..0 of the 32b source are written. Memory can be read or written, both not both, in a single instruction, thus the selection of memX as both scalar registers A and B in the same instruction is invalid. Selection as scalar register B when one of the sources in the ALU 26D operation is scalar register B is invalid.

0x1 A - s emlβ, gives 16b memory operands. Gives a memory read when selected as scalar register A and the 16b operand is sign-extended to 32 bits. Gives a memory write when selected as scalar register B; only 5..0 of the 32b source are written. The operand can be on any byte address, even if it spans two cache pages, although performance may be improved if it is on a 2-byte boundary since it would not cross a cache page boundary. Same restrictions on use as for smemδ.

OxlB - smem32 (preferred) or smem, gives 32b memory operands. Gives a memory read when selected as scalar register A. Gives a memory write when selected as scalar register B. The operand can be on any byte address, even if it spans two cache pages, although performance may be improved if it is on a 4-byte boundary. Same restrictions on use as for smemδ .

OxlC ξ£ loop, loop counter, automatically decrements by one on Jump/Call instructions when the "loop" condition code is selected. The "loop" condition code replaces the former scalar condition code (sec) of "never" ("never jump", code OxE). The value of the loop counter is tested prior to the decrement. The down-counter is 16 bits in 5..0. Bits 31..16 are a general-purpose register.

Ox ID - ap, address pointer. It is used for making temporary pushdown stacks for the storage of scalar operands in a single instruction. It is automatically decremented by the size of the memory operand, 1, 2 or 4 bytes, before an operand is stored with the push_a instruction, and incremented by the size of the memory operand after an operand is read with the pop_a instruction. The 5 lsbs also provide an optional rotate code to the barrel shifter 26E in the scalar arithmetic unit 26, and the 10 lsbs provide an optional programmable I/O address for the I/O instructions. The address pointer is located in bits 25..0 of the register. Bits 31..26 are a general- purpose register.

Ox IE - sp (formerly ssp), stack pointer 26C. It is used for making a system-level pushdown stack. It handles the jobs of the former A236 scalar stack pointer, vector stack pointer (no longer needed because all memory operands are now stored in the Data Cache 11) and interrupt stack pointer (no longer needed because the Data Cache 11 is fully non-aligned). It is automatically decremented by the size of the memory operand, 1, 2 or 4 bytes, before an operand is stored with the push and call instructions, and incremented by the size of the memory operand after an operand is read with the pop and ret instructions. The address pointer is located in bits 25..0 of the register. Bits 31..26 are a general-purpose register.

OxlF - pc, program counter and status word. There is a 26b program address (b25..0) where bit 0 is always 0, hence bit 0 does not need to be stored or computed. Jump | Call targets must be on a 4-byte boundary unless the target instruction is a dual scalar, in which case a 2- byte boundary can be used. Status flags, window base and interrupt enable are also stored in the register so that an interrupt requires the saving/restoring of only one word, pc . Writing to this register as a general purpose register affects only the window register base and flags so that it can be loaded at will without changing the program address or interrupt enable.

Program Counter 26B Bit

31 30 29 28 27 26 25 24 23..1 0

Old X X X X X X X X address b23..2 of next 32b instruction, bit 1 = X X

New w w c n z of address b25..1 of next 16b or 32b instruction ie bl bO The organization is (read-only bits are loaded or changed only by certain instructions): bit 0 (read-only) - ie, interrupt enable, set by enable interrupts ei and cleared by disable interrupts di, must be explicitly cleared with di as the second or third instruction in an interrupt service routine and, if operation of the interrupt system is desired, must be set by ei as the last instruction at the end of an interrupt service routine, in the deferred position for ret

bits 25..1 (read-only) - address of next instruction to be executed. In the instruction pipeline it is called the Scalar PC, which is the return-address required by callx and hardware interrupt. This value can be loaded only by transfer-of-control and pop instructions, and the interrupt system. Otherwise, using the pc as a destination as specified by the scalar register B address has no effect on these bits. bit 26 (read-write) - of, scalar ALU overflow bit, updated by arithmetic instructions bit 27 (read-write) - z, scalar ALU zero bit, updated by boolean and arithmetic instructions bit 28 (read-write) - n, scalar ALU negative (sign) bit, updated by boolean and arithmetic instructions bit 29 (read-write) - c, scalar ALU carry bit, updated by arithmetic instructions bits 31..30 (read-write) - wbl..O, window register base, selects a 16-register window into the set of 32 windowed registers (WR31 . . 0) in the parallel arithmetic units 22 for access as VR31 . . 16 by vector instructions in the "normal" format. When wbl ..O = 00b,

WR15..0 are selected. When wbl..O = 01b, WR23..8 are selected. When wbl.,0 = 10b, WR31..16 are selected. When wbl.,0 = l ib, WR7..0,31..24 are selected. However, v_mvm_x, v_mvmadd_x, v_conv_x, v_convadd_x, v_conv2_x, v_conv2 add_x, v_PixDist_x and v_PiχBest_x can access any of the 32 windowed registers for one of their operands, but cannot access the special registers.

9. Summary of Vector Registers Each parallel arithmetic unit 22 (see Figure 9) has its own local set of "vector registers". The number of windowed registers 22 A per parallel arithmetic unit 22 is reduced from 64 to 32 (but there are eight times as many parallel arithmetic units 22). The windowed registers 22A can now be referred to by WR31 . . 0 for some instructions. Four additional general-purpose fixed (non- windowed) registers 22B, FR3 . . 0, are added.

The v_mvm_x, v_mvmadd_x, v_conv_x, v_convadd_x, v_conv2_x, v_conv2add_x, v_PixDist_x and v_PixBest_x instructions bypass the windowing mechanism to access all of the windowed registers 22A. These multiply instructions can also select any one of the two 32b accumulators in any one of the four parallel arithmetic units 22 in a row to save the result. These eight 32b accumulators are referred to by AC7 . . ACO.

The complete listing of the vector register A and B addresses for the A436 chip 10 is: 0: FRO, fixed register 0

1 : FR1, fixed register 1

2: FR2, fixed register 2

3: FR3, fixed register 3

4: accumOL, also known as accumL, accumulator 0 lsbs, is 16 lsbs of ACO in VP0, AC2 in VP2, AC4 in VP2 and AC6 in VP3

5: accumOM, also known as accumM, accumulator 0 msbs, is 16 msbs of ACO in VP0, AC2 in VP2, AC4 in VP2 and AC6 in VP3

6: accumlL, accumulator 1 lsbs, is 16 lsbs of AC1 in VP0, AC3 in VP2, AC5 in VP2 and AC7 in VP3

7: accumlM, accumulator 1 msbs, is 16 msbs of AC1 in VP0, AC3 in VP2, AC5 in VP2 and AC7 in VP3

8: SPB, scalar processor broadcast register 22F. It is a general purpose read/write register in each parallel arithmetic unit 22 that can also be read and written by the scalar arithmetic unit via the Scalar IO Bus, but software should ensure that no conflicts in access occur. 9: vindex, vector index register. Only the addressing of the registers in the windowed register bank (WR31 . . 0) is affected. The 5 lsbs of the vector index register can be used to specify the number of bits to rotate right with the barrel shifter using the v_ror_#0 instruction, or to select the windowed register accessed by vim.

OxA: mask (read-only), no change, gives 16b word of sign bit from vpsw except in sixteen byte compressed signed mode, which gives 8 lsbs of sign bit 7, and 8 msbs of sign bit 15; specifying mask as the Vector Register B for v_move gives a vector NOP (neither the vector registers nor status bits are changed)

OxB: vpsw, vector processor status word, no change to contents

OxC: vm, also known as vmn, vector memory normal access, no change to function

OxD: vmow, vector memory overwrite, when used as "A" address is same as vector memory normal access; when used as "B" address, writes to memory without reading first; for filling display buffers and block moves

OxE: vmio, vector memory IO, unused / illegal

OxF: vim, vector index memory, used with the vector index register vindex . When vim is specified as the vector register "A" address, the contents of vindex are used as the address for accessing the windowed registers. When vim is specified as the vector register "B" address, the contents of vindex are used as the address for accessing the windowed registers

22A. Note that vector register addressing is pipelined, resulting in a one-cycle delay between the loading of vindex and its use as an address. Thus, if vindex is loaded in one cycle, the new value cannot be used in the next cycle, but can be used in the cycle after that.

0xlF..0xl0: VRn, n = 15d..0, there are 32 windowed registers (WR31 . . 0) per parallel arithmetic unit 22. VRn refers to a register in the selected set of windowed registers. The register window is selected by the window base bits wbl..O in pc. See the description of pc.

10. Multibit Scalar and Vector Rotate Improved (s_ror , v ror ) The existing rotate instructions are improved realtive to the earlier A236. When a shift of 0 is specified for s_ror_#n, the contents of the address pointer ap specify the amount of the shift in the scalar arithmetic unit 26. When a shift of 0 is specified for v_ror_#n, the contents of the index register in each parallel arithmetic unit 22 specifies the amount of the shift in the parallel arithmetic unit 22. This enables the shift to be changed easily, enables larger shifts to be done and enables different parallel arithmetic units 22 to have different shifts.

The barrel shifter 26E in the scalar arithmetic unit 26 is increased to 32 bits. Rotates from 0 to 31 bits can be programmed.

The barrel shifter 22C in each parallel arithmetic unit 22 is increased to 32 bits. A rotate over only 16 bits is made unless accumOL or accuml is selected by the vector register A address.

When either of these two registers is selected, a rotate over 32 bits is done. If accumOL is selected, the 16 msbs going into the barrel shifter 22C come from accumOM. If accumlL is selected, the 16 msbs going into the shifter 22C come from accumlM. The input into the 16 lsbs of the shifter 22C come from the register selected by the vector register A address.

The rotate instructions are as follows, where the parameter "n" is in the range OxF. O. nop, instruction code: b31..28 (opcode modifier) = 0x8, b22..18 (Vector Register B) = mask, b8..4 (Scalar Register B) = imm, all other fields are don't care except Scalar Register A cannot be coded as memX or imm .

s_ror_#n sra, srb, instruction code: b31..28 (opcode modifier) = 0x8, b3 .0 (scalar opcode) = n = number of bits to rotate right (from msb toward lsb). If n = 0, ap (address pointer) register b4..0 are used as the shift code. Performs scalar nop (s_nop) if Scalar Register B = imm. Scalar Register A cannot be memX .

v_ror_#n vra, vrb, instruction code: b31..28 (opcode modifier) = 0x8, M7..14

(vector opcode) = n= number of bits to rotate right (from msb toward lsb). If n = 0, vindex

(index) register b4..0 in each parallel arithmetic unit are used as the shift code for that parallel arithmetic unit. Performs vector nop (v_nop) if Vector Register B = mask. 11. Scalar Q Instructions Replaced by New Parallel Data Type

A236 opcode group 5, quad word (packed) signed with scalar Q, is replaced by quad word interleaved signed in the A436 chip 10. The scalar Q functions and register are deleted.

12. Conditional Execution: s_test_scc , v_test_vcc and v_testset_vcc

A236 opcode group OxB, vector test and set, is expanded to provide s_test_scc, scalar test and jump, and v_test_vcc , vector test. The A236 instruction, v_test_vcc, is renamed v_testset_vcc to reflect its setting of the vector processor enable bits.

The software tools ensure that there are no conflicts driving the Data Cache Data Bus. When v_testset_vcc is performed, the tools ensure that next instruction is not v_move vr a , vmx . This is because, in this group of instructions, the choice of which parallel arithmetic units 22 drive the Data Cache Data Bus depends upon which vector processor enable bits are set. There must be at least one cycle between the update of the vector processor enable bits and the use of them for driving the Data Cache Data Bus so that conflict resolution logic has time to disable the driving of the bus by any conflicting parallel arithmetic units 22. In the event of a conflict, only the parallel arithmetic unit 22 in the lowest numbered row for a given byte being driven to memory is allowed to drive the bus.

If s_test_scc is performed and the test condition is satisfied, the next scalar instruction (can be in 32b or dual scalar format) is nullified for this one test only by dynamically replacing it with a s_nop in the instruction pipeline, otherwise the next instruction is executed normally.

If v_test_vcc is performed and the test condition is satisfied by a given parallel arithmetic unit 22, the next instruction to that particular parallel arithmetic unit is nullified for this one test only by dynamically placing a v_nop in the instruction stream to that particular parallel arithmetic unit 22, otherwise the next instruction is executed normally. If v_testset_vcc is performed and the test condition is satisfied by a given parallel arithmetic unit 22, the vector processor enable bit for that parallel arithmetic unit remains set, otherwise it is cleared. Once cleared, a parallel arithmetic unit 22 treats most instructions as v_nop and its vector processor enable bit remains clear until either the vector processor enable bits are loaded by the scalar arithmetic unit or the v_testset_on instruction is executed.

The rules for the changing of the vector processor enable bits are:

1) all vector processor enable bits are cleared by reset

2) v_testset_on unconditionally sets all vector processor enable bits

3) v_testset_of f unconditionally clears all vector processor enable bits

4) in xvpe, srb unconditionally copies all vector processor enable bits into srb

5) out sra , xvpe unconditionally loads all vector processor enable bits from sra

6) v_testset_vcc, where vcc is not on or off, clears the vector processor enable bit in a vector processor that fails the test and whose enable bit is set; no change is made in a vector processor whose enable bit is clear

Interrupts are inhibited for one cycle to protect the modification of the next instruction.

The standard scalar and vector test conditions are used. When neither v_test_vcc nor v_testset_vcc is done, the contents of the parallel arithmetic units 22 are unchanged by the s_test_scc instruction.

The formats for the instructions, where vcc = vector condition code and sec = scalar condition code, are:

The condition codes are:

13. Table of Major Instruction Groups

It should be noted that for the opcode groups that specify a parallel data type, with the exception of sixty-four byte compressed signed, the selection of a particular parallel data type only affects the handling of memory operands by the parallel processing unit 20. Otherwise, when no memory operation is performed by the parallel processing unit 20, the default parallel data type is thirty-two word packed signed, _twps, in which case all active parallel arithmetic units operate upon signed 16b data in their registers.

14. Changes to Instruction Codes

a. Table of Scalar ALU 26D Instructions

The functions of some of the Scalar ALU 26D opcodes depend upon the choice of the destination, which is specified by Scalar Register B. In general, when an operand is written to memory only the ALU functions that do not reference B are useful. The Scalar Register B address is decoded along with the Scalar ALU opcode to determine the function performed, imm is included with memX to simplify decoding although no data is stored when imm is the destination.

There is no change in the Scalar ALU 26D functions when an operand is written to a writeable physical register instead of memX. A scalar NOP is performed for s_move sra, imm, in which case neither the scalar registers nor the scalar status bits are changed.

The Scalar ALU 26D functions are as follows. These functions apply wherever the standard format scalar instructions are used. Note that the pop instructions reverse the roles of the Scalar

Register A and Scalar Register B fields, in which case the "A" operand is the destination (a register) and the "B" operand is the source (memory). CFF = carry flipflop. In general, ALU opcodes 7..0 (logic functions) update only the sign (N) and zero (Z) flags, while ALU opcodes OxF..8 (arithmetic functions) affect all of the flags (C, N, Z, O).

Saturation operates as follows (only the zero and sign flags are changed, based upon the 32b result that is formed before being truncated to the specified number of bits):

1) smemδ - If data is saturated unsigned, the 32b value OxFFFFFFOO is formed for any 32b negative value and the 32b value 0x7FFFFFFF is formed for any 32b positive value greater than OxFF. If data is saturated signed, the 32b value 0xFFFFFF80 is formed for any 32b value more negative than 0xFFFFFF80 and the 32b value 0x7F is formed for any 32b value greater than 0x7F. Any value in between is used without modification. The 32b value is then truncated to 8b and stored.

2) smemlό - If data is saturated unsigned, the 32b value OxFFFFOOOO is formed for any 32b negative value and the 32b value 0x7FFFFFFF is formed for any 32b positive value greater than OxFFFF. If data is saturated signed, the 32b value 0xFFFF8000 is formed for any 32b value more negative than 0xFFFF8000 and the 32b value 0x7FFF is formed for any 32b value greater than 0x7FFF. Any value in between is used without modification. The 32b value is then truncated to 16b and stored.

3) smem32 - no saturation is performed, the 32b operand is written without modification.

b. Operation of Push, Pop, Call and Return Instructions

A pushdown stack is used: the stack pointer 26C sp is decremented before an operand is stored on the stack. Inversely, an operand is read from the stack then the stack pointer 26C is incremented. Thus, the stack pointer 26C addresses the last operand stored, if any.

The stack-oriented instructions operate as follows:

c. Table of Vector ALU 22C Instructions

The functions of some of the Vector ALU 22C opcodes depend upon the choice of the destination, which is specified by Vector Register B. In general, when an operand is written to memory only the ALU functions that do not reference B are useful. The Vector Register B address is decoded along with the Vector ALU 22C opcode to determine the function performed.

There is no change in the Vector ALU 22C functions when Vector Register B specifies that an operand is written to a writeable register. A vector NOP is performed for v_move vra, mask, in which case neither the vector registers 22A, 22B nor the vector status bits are changed. The Vector ALU 22C functions are as follows when an operand is written to memory. They apply wherever the standard format vector instructions are used (opcode modifiers = all except 8, OxA, OxB and OxC, and some of them, too). In general, use of ALU opcodes 7. 0 update only the sign and zero flags, while use of ALU opcodes OxF..8 affect all of the flags. When saturation is performed, a full 16b result is formed for the purpose of setting the flags, then the result is truncated to 8b for storage.

d. Table of v mul Instructions

e. Table of v mul add Instructions

f Table of Cache and IO Instructions

g. Table of Jump and Call Instructions

See also Operation of Push, Pop, Call and Return Instructions.

In the A436 chip 10, any instruction immediately following a jump, call or ret is executed unconditionally. The instruction may be of any length (2, 4 or 8 bytes).

The immediate data for jumpix and callix is formed as follows for jump and call instructions:

1) absolute address: value (unsigned) = (J^umP address / 2), imm2l limits addresses to the lower 4 MB of memory, imm25 allows direct use of the full address space

2) relative address: value (signed, 2's complement) = (displacement / 2) between PC and desired address, the value of PC used for the calculation is the address of the second instruction after the j ump or call, for imm21 the displacement can be approx. plus/minus 2 MB, for imm25 the displacement can be approx. plus/minus 32 MB.

The register contents for jumpxy and callxy are full 32b length with lsb = 1 byte (the lsb is ignored because it is always 0) so transfers anywhere in memory can be made on either an absolute or relative basis, imm can also be used to specify a relative or absolute address.

15. Interrupt System and Interrupt Inhibits

All internal devices are able to interrupt the processor. Interrupt latency is extremely short, less than one microsecond, assuming that the interrupt system is enabled. The enable interrupt ei instruction turns on the interrupt system and the disable interrupt di instruction turns it off. Each internal device has several registers that must be initialized for an interrupt to occur. Each device has a control register that specifies the conditions under which a device may interrupt the processor, and enables one or more of those conditions to cause an interrupt. Software has full control over the bits in the control register so that software can completely emulate a hardware interrupt. This is useful for initializing and testing the interrupt system.

Each internal device has its own interrupt vector and priority, both of which are fully software programmable and stored in the Interrupt Controller in the A436. The interrupt vector sends the program directly to the interrupt service routine for a particular device without having to go through a branch table. This reduces interrupt latency. A single interrupt vector is assigned to each internal device, regardless of the number of conditions in that device that can cause an interrupt. The interrupt vector is full length, so transfers anywhere in memory can be made.

Interrupt inhibits are provided so that no critical sequences of instructions are broken by an interrupt. In addition, the Interrupt Controller allows all pending memory transfers to complete without conflict from the interrupt service routines. A single NOP is inserted by the interrupt controller to ensure that any pending memory requests are completed before the interrupt service routine starts to run.

The interrupt service routines are responsible for saving and restoring the state of the machine. The first instruction in every interrupt service routine must be push pc , smem32 to save the return address. The last instruction in an interrupt service routine must be ret.

Several status bits that were formerly in the scalar processor status word are now in pc so that a single push saves the return address and the state of the scalar arithmetic unit 26. Only a single pop, which is performed by ret, is required to restore it. The interrupt enable bit that is saved by the initial push pc , smem32 is set if interrupts had been enabled until that time, thus the second or third instruction in an interrupt service routine must disable interrupts so that the condition that caused the interrupt can be cleared, otherwise the current interrupt will interrupt itself indefinitely. Some internal devices have multiple conditions that can cause an interrupt. It is the responsibility of the interrupt service routine for that device to process the various conditions.

The Interrupt Controller 30 (Fig. 5) is level-sensitive not edge- sensitive to interrupt requests from the internal devices to ensure that no interrupts are lost. The interrupt service routine must write to the appropriate control register in the device to clear the bit that caused the interrupt before re-enabling the interrupt system. The interrupt enable bit is cleared automatically by the hardware when interrupt processing begins.

A priority register is provided in the Interrupt Controller 30 so that software can load it with the minimum priority that can cause an interrupt. This allows higher priority interrupts to interrupt lower priority interrupt service routines that are in progress.

Single-cycle interrupt inhibits are automatically provided by the hardware when the following conditions occur:

1) jumps and subroutine calls, jumpxy and callxy

2) returns from interrupts and subroutines, ret, and enable interrupts, ei

3) the next instruction is s_xxx smemX, srb, which reads a scalar memory operand (smemδ, smemlδ or smem32), or s_xxx sra, smemX, which writes a scalar memory operand (smemδ, smemlδ or smem32)

4) the instruction (ICache b31..0) being decoded is a 64b extended instruction (which blocks instruction look-ahead), with both scalar register A and scalar register B being the immediate operand (imm)

5) s_test_scc since it can change the operation of the next 16b or 32b instruction immediately after it

6) v_test_vcc since it can change the operation of the next vector instruction immediately after it

Condition 3 is indicated by any instruction at the decode-next stage (ICache b63..32) that references a scalar memory operand (smem8, smemlβ or-+ smem32). The condition is: {[b31..28 (opcode modifier) = any value except Ch (call and jump)] AND [b 13..9 (scalar register A) = smemδ, smemlδ or smem32]} OR

{[b31..28 (opcode modifier) = any value except Ch (call and jump)] AND [b8..4 (scalar register B) = smemδ, smemlδ or smem32]}

This guarantees that this instruction and the one before it always execute without being split. Note that this instruction is in the "decode-next" stage of the instruction pipeline.

Condition 4 is indicated by any instruction at the decode stage (ICache b31..0) that has the immediate operand (imm) as scalar register A. The condition is:

{[b31..28 (opcode modifier) = any value except OxC (call and jump) or 9 (dual scalar) or OxB (parallel I scalar test)] AND [M3..9 (scalar register A) = imm]} AND [b8..4 (scalar register

B) = imm]}

This guarantees that this instruction and the one AFTER it always execute without being split. This protects a subsequent scalar memory operation. Note that this instruction is in the "decode" stage of the instruction pipeline.

Conditions 5 and 6 are indicated by any instruction at the decode-next stage (ICache b63 .32) that has instruction code: b31..28 (opcode modifier) = OxB (parallel | scalar test). This guarantees that this instruction and the one after it (the instruction that is being executed conditionally) always execute without being split.

The A436 Instruction Cache 28 allows a look-ahead to the next instruction, except when a 64b extended instruction is used. When such an instruction occurs and the immediate operand imm is referenced as scalar register A and ALU = s_move, it is conservatively assumed that the instruction generates a memory address that is used to reference a scalar memory operand. The combination of look-ahead and this assumption allow the Interrupt Controller 30 to ensure that scalar address / scalar memory reference instruction pairs are never broken. Notes:

1) Interrupts are inhibited by the Interrupt Controller during the very first three instructions of interrupt service routines (just in case for (1) push pc, (2) push PSW and (3) disable interrupt, to prevent interrupt service routine from being interrupted by the same pending interrupt request).

2) The second or third instruction in the interrupt service routine must be di (disable interrupt).

16. Handling of Predictable Pipeline Delays and Dependencies

The C compiler and assembly language programmer are responsible for ensuring that predictable events such as transfer of control, i.e., deferred jumps, calls and returns, are handled correctly. If no useful instructions can be placed immediately after a transfer of control, then one NOP (formerly one or two in the A236, depending on the instruction) must be used.

The C compiler and Assembler are responsible for detecting and avoiding potentially illegal combinations of memory references in a single instruction, such as simultaneous and conflicting scalar and vector memory references. Previously, the A236 hardware detected certain memory access conditions and stalled the hardware if necessary. More such conditions occur in the A436 than the A236 because all memory operations are handled by the Data Cache 11, the implementation of the stack operations is improved, and a stack-like address pointer register ap has been added.

In addition to transfers of control, the C compiler and assembly language programmer are responsible for handling pipeline delays in instruction execution. There are only a few of them because the A436 chip 10 uses a shallow instruction pipeline. These delays are:

1) parallel arithmetic units 22 operate one cycle behind th scalar arithmetic unit 26;

2) multipliers 22D in the parallel arithmetic units 22 are pipelined;

3) the motion estimation coprocessors 24 are pipelined;

4) the use of scalar processor broadcast registers 22F when referenced by the parallel arithmetic units 22; 5) placement of value in vindex and use for accessing windowed registers 22 A;

6) change of vector processor enable bits and use to control driving of the Data Cache 11 Data Bus; and

7) the sp and ap registers are not be explicitly modified at the same time they are being used by the push and call instructions, which modify them early ~ during the decode stage of the instruction pipeline, not the scalar execution stage.

Interprocessor communication is improved by the addition of a scalar (processor) broadcast register 22F in each parallel arithmetic unit 22. As a result, interrupt inhibits to protect interprocessor communication are no longer required since data placed in the scalar processor broadcast registers 22F will stay there indefinitely, like in any other addressable register.

The output of the Address Mux 26F in the scalar arithmetic unit 26 is always loaded into the address register for potential use the next instruction cycle. The address is held, along with the rest of the instruction pipeline, in the event of an ICache or DCache miss. Interrupts are inhibited for one cycle whenever an address computed by one instruction word may be required by the next instruction word to access memory for a scalar operand. It is not necessary to inhibit interrupts to access memory for a parallel operand since address calculation and memory reference are in the same instruction word.

Access to the Data Cache 11 can come from several stages in the Instruction Pipeline. The Assembler and C compiler are responsible for avoiding all potential conflicts and the hardware makes no attempts to resolve them.

Access to the Data Cache 11 occurs in the instruction pipeline as follows, with memory addressing as follows:

(A) fetch stage: no Data Cache 11 memory access initiated decode stage: (memory read), memory read for: (1) pop with address from sp, (2) pop_a with address from ap and (3) ret with address from sp; the memory read is done at the end of this cycle so the scalar operand read from memory can be used next cycle by the same instruction word; memory read for s_xxx smemX , srb, with address from imm if previous instruction word was s_move imm, srb, otherwise address is from scalar register B, which assumes that the previous instruction word was s_xxx sra , srb (sra not imm); the memory read is done at the end of this cycle so the scalar operand read from memory can be used next cycle by the same instruction word;

(B) scalar execution stage: (memory read or write), memory write for: (1) push with address from sp, (2) push_a with address from ap, (3) callxy_scc with address from sp, and (4) s_xxx sra , smemX with address from memory address register, which was loaded the previous cycle by the previous instruction word; the memory write is done at the end of this cycle; memory read for vector access, v_xxx vmX, vrb and v_load vm, vrb, with address from imm if paired (in same instruction word) with s_move imm, srb otherwise address is from scalar register B, which assumes that the paired instruction (in same instruction word) is s_xxx sra , srb (sra not imm); memory read is done at the end of this cycle so the parallel operand read from memory can be used next cycle;

(C) 1st vector execution stage: (memory write),

memory write for vector access, v_xxx vra , vmX and v_storeX vra , vm, with address from memory address register, which was loaded the previous cycle by the same instruction word; memory write is done at the end of this cycle;

2nd vector execution stage: no Data Cache 11 memory access initiated.

17. Inter-Processor Communication

The scalar arithmetic unit 26 can broadcast data to multiple parallel arithmetic units 22. This is useful for providing the same variables for processing and coordinating the operation of multiple parallel arithmetic units 22. Communication between the scalar arithmetic unit 26 and the parallel arithmetic units 22 can be handled via either memory or register-to-register transfers. Register-to-register transfers are the fastest because they avoid the possibility of Data Cache 11 misses. Register-to-register transfers are handled using the I/O instructions.

A new register, the scalar processor broadcast register 22F, is added to each parallel arithmetic unit 22. It operates as a general-purpose register when referenced by a parallel arithmetic unit 22. It can also be read and written by the scalar arithmetic unit 26 via the Scalar IO Bus. It can be loaded by the scalar arithmetic unit 26 regardless of the state of the vector processor enable bit of the parallel arithmetic unit 22 it is located in.

The scalar processor broadcast register 22F in each parallel arithmetic unit 22 can be loaded from the Scalar IO Bus in these ways by the scalar arithmetic unit:

1) the scalar processor register 22F in any single parallel arithmetic unit 22 can be selected and loaded;

2) all of the scalar processor broadcast registers 22F in any one row of the Parallel Processing Unit 20 can be selected and loaded with the same value;

3) all of the scalar processor broadcast registers 22F in any even pair of rows of the Parallel Processing Unit 20 can be selected and loaded with the same value;

4) all of the scalar processor broadcast registers 22F in the first or second four rows of the Parallel Processing Unit 20 can be selected and loaded with the same value;

5) all of the scalar processor broadcast registers 22F in any one column of the Parallel Processing Unit 20 can be selected and loaded with the same value;

6) all of the scalar processor broadcast registers 22F in any even pair of columns of the Parallel Processing Unit 20 can be selected and loaded with the same value; or

7) all of the scalar processor broadcast registers 22F in the Parallel Processing Unit 20 can be loaded with the same value. 18. Vector Processor Enable, Vector Zero and Processor ID Registers

The vector processor enable bits are important in the A436 chip 10 because the A436 chip 10 has 32 parallel arithmetic units 22 instead of the four parallel units found inthe earlier A236.. In addition, it is useful to distribute the vector zero flags to neighboring parallel arithmetic units 22 to coordinate the processing of multiple adjacent bytes of data, such as for chroma keying.

When an operation such as:

v_move_owps vra, vm

is performed, the vector processor enable bits determine which parallel arithmetic units 22 send data to memory. The programmer must ensure that no more than one parallel arithmetic unit 22 that is driving a given byte to memory has its vector processor enable bit set. Otherwise, conflict resolution logic only allows the parallel arithmetic unit 22 whose vector processor enable bit is set in the lowest-numbered row to drive a given byte to memory.

When an operation such as (note that the sequence of operands is different than above):

v_move_owp s vm , vrb

is performed, all parallel arithmetic units 22 whose vector processor enable bits are set will receive the memory data and store it in their respective vrb registers. The A436 chip 10, with its

32 parallel arithmetic units 22, can be made to operate like an A236 with its one row of four parallel arithmetic units 22 by setting the vector processor enable bits in only one row, row zero, of the Parallel Processing Unit 20.

The Vector Processor Enable Register and Vector Processor Zero Register are added to the System Device that is controlled by the Scalar IO Bus 25 (see Fig. 8). However, they do not have any dedicated storage of their own. The Vector Processor Enable Register reads and unconditionally writes the vector processor enable bit in each parallel arithmetic unit 22. It can also read all of the vector processor enable bits so the state of all vector processor enable bits can be saved or restored in a single operation. The Vector Processor Zero Register reads (only) the vector processor zero bit in each parallel arithmetic unit 22.

The System Device provides two status bits to the instruction decoder 16 for conditional branches. One bit (vad) is the NOR of all 32 vector processor enable bits, which is true if all vector processor enable bits are zero, i.e., if all vector processors/parallel arithmetic units 22 are disabled. The other (vaz) is the AND of ((NOT vector processor "N" enable) OR vector processor "N" zero) for N = 31..0, which is true if all active (enabled) vector processors/parallel arithmetic units 22 have their zero flags set. The latter is useful for simultaneously detecting that up to 32 entries in a 2D DCT matrix are zero to indicate "end of block", rather than checking all of the entries serially.

A Processor Identification (PID) register is also provided.. It is read-only and its contents are set at the time of chip manufacture.

19. I/O Instructions

The s_io instruction in the A236 is replaced by the in ioreg, srb and out sra, ioreg instructions. These instructions simplify instruction decoding since the I/O address, ioreg, is always located in the same bits (vector register A field = ioreg b9..5 and vector register B field = ioreg b4..0), and the direction of the transfer is specified by the choice of instruction. Flags are modified as for s_move.

The names of all I/O or "extended" registers are prefixed by an "x".

In addition, new capability is added to the IO instructions to allow programming of the I/O address. When ioreg = 0x3FF, i.e., when the 10b I/O address is at its maximum value, the 10 lsbs of the ap register are used as the IO address.

The instruction coding is:

20. Full-duplex Audio Port 32

A full duplex, stereo, programmable audio port 32 is added, as is a debug port 34 and a bit- programmable I/O port 36, to complement the serial bus port 38 and the UART (RS232) port 40. The audio port 32 can simultaneously send and receive digital audio serially. It can handle monaural or stereo data. Each data sample can have 4, 8 or 16 bits and can be in little-endian or big-endian format.

The audio port 32 has interrupt capability and uses programmed data transfers to service it. It provides two 8-byte (128b) buffers, one for input and one for output, for simultaneously sending and receiving audio samples. It is compatible with audio codecs such as the Crystal CS4215.

21. Bit-Programmable IO Port 36

The bit-programmable IO port 36 has 8 programmable bits and interrupt capability. Through control of the output circuit for each bit, each bit can be an input, an output or both. Interrupts can be programmed on neither, either or both edges of an input.

22. Parallel DMA Ports Improved The changes to the parallel ports are:

A new mode bit is added to a control register in each port so that either 8b or 16b data can be transferred to/from an external device. The number of address bits used in the Parallel DMA Ports is increased to 26 to provide 64 MB addressing. This reduces the number of bits in the "user byte" to six.

The number of bits in each of the frame and frame count registers in the Frame Count Register is increased from 4 to 8 to handle up to 256 frames or fields in a circular buffer.

A new control bit is added that determines whether or not data is stored/read when blanking signals are active. This affects whether or not data can be captured/output during the blanking interval.

23. Debug Device

Four built-in "logic analyzers" are provided to create an interrupt when a particular instruction, instruction address, memory operand or memory address is detected. They are controlled by extended registers. A bit-programmable mask register is provided for each test condition to select the bits to be tested, enabling ranges of values to be tested. When enabled, all four conditions are tested simultaneously and continuously. When any match is detected, information from the previous, current and next CPU cycles are stored to assist in debugging, and a nonmaskable interrupt occurs.

24. Reset

Only a few registers are cleared by Reset. All other registers, including extended registers for IO control, are initialized by software before they are used.

The following registers are cleared by Reset:

1) program counter 26B

2) valid bits (only) in Data Cache 11 and Instruction Cache 28

3) interrupt enable

4) "active" flags in Left and Right parallel DMA ports 14 and 18

5) vector processor enables 25. Power Management and Clocking

Referring to Figure 5, typically an external oscillator or 33 MHz third-overtone crystal 42 is used. An internal PLL (phase locked loop) 44 multiplies the frequency by 1, 2, 3 or 4 to provide the memory clock (SDRAM Clk). The resulting memory clock is divided by a power of 2 to obtain the CPU clock. The SDRAM configuration and memory clock parameters are set on the Host Port when the A436 Chip 10 is reset. These values and others are stored in an extended register in the memory interface and can be changed under software control to provide power management.

The operating modes are:

active - memory clock = 4 x input clock = 133 MHz and CPU clock = memory clock 1 2 - 66.5

MHz idle - memory clock = 33 MHz and CPU clock = 1 to 1/1024 the speed of the memory clock; this mode is typically used while waiting between image captures in a camera, and images may be captured and displayed in this mode because memory bandwidth is sufficient if the SDRAM is active; active mode can be entered in a few instruction cycles sleep - the clock PLL 44 is turned off, setting the power dissipation of the A436 chip 10 essentially to zero since static CMOS logic is used. An external signal is required to activate the

PLL 44 and about 1 ms is required for the crystal oscillator 42 and PLL 44 to resume normal operation.

In addition, the external SDRAM can be controlled. Clock Enable to the external SDRAM may be disabled to greatly reduce memory power. A software procedure is used to properly disable Clock Enable to begin saving power, and to re-enable Clock Enable when memory use is required again. The contents of the SDRAM will be maintained by automatic refresh in the SDRAM, but the contents cannot be accessed while Clock Enable is false.

C. Table of Sizes of Memory Operands Showing Coding of Register Fields

A worst case length of Size = 4 bytes for scalar memory accesses and Size = 16 bytes for vector memory references may be assumed by the Data Cache 11 Controller to determine whether or not a parallel operand spans two cache pages. The Data Cache Controller adds (Size - 1) to the address of each operand to get the address of the highest byte of the operand. A 64-byte Data Cache page boundary is crossed when the address 6 lsbs roll over from 0x3F.

The "Size in Bytes, Basic / Mem." column gives the number of bytes processed by a given parallel arithmetic unit and the total number of memory bytes accessed in the Data Cache. A single value is given if the two values are the same. The memory size is larger than the basic size for interleaved data formats, because only every other byte or word is used, and for convolution and motion estimation instructions, which provide multiple sets of staggered operands to the parallel arithmetic units.

NA = not applicable. X = don't care (the information in parenthesis specifies the use of the field).

It should be noted that the new load instructions have the vector register B value in its usual (A236) place. However, the new store instructions reverse the position of the vector register A and B fields so that there is a common format for the instruction unit to decode the row- and column-select codes. Thus, while the sequence of assembler mnemonics is unchanged, the binary pattern for the use of the vector register A and B fields is reversed for some of the instructions.

For the parallel processing unit 20 (formats of the register fields, with allowable memory references, are also shown, where register 3 gives vm and the default register formats are VRx):

or next or prev.

D. New Instructions

1. New One-dimensional Parallel Data Types

The new one-dimensional parallel data types are:

a) quad word interleaved signed (_qwis) - four alternating 16b signed words in memory b) octal word packed signed (_owps) - eight successive 16b signed words in memory c) octal byte interleaved signed (_obis) - eight alternating 8b signed bytes in memory d) octal byte interleaved unsigned (_obiu) - eight alternating 8b unsigned bytes in memory e) sixteen byte packed signed (_sbps) - sixteen successive 8b signed bytes in memory f) sixteen byte packed unsigned (_sbpu) - sixteen successive 8b unsigned bytes in memory g) sixteen byte compressed signed (_sbcs) - sixteen successive 8b bytes in memory that are passed in pairs to the parallel arithmetic units as though the data format were octal word packed signed h) thirty-two word packed signed (_twps) - thirty-two 16b words in registers, is the format for register-register (non-memory) operations in the parallel processing unit, cannot be moved to/from memory in a single cycle hence the "packed" modifier is not significant

The symbols, such as "_qwis", are used by the assembly-language programmer and C compiler during the code generation stage to specify the parallel data type desired, and by the assembler to select the proper instruction to process it.

2. Summary

The new classes of instructions are:

1) matrix- vector multiply - compute eight four-point sums-of-products simultaneously; when memory is referenced, one set of four operands is used for all computations

2) convolution - compute eight four-point sums-of-products simultaneously; operands from memory are fed in a staggered fashion to the rows of the parallel processing unit for computation

3) convolution-pair - compute eight four-point sums-of-products simultaneously; operands from memory are fed in a staggered fashion to each pair of rows of the parallel processing unit for computation; used for wavelets 4) pixel distance / motion estimation; operands from memory are fed in a staggered fashion to the rows of the parallel processing unit for computation

5) row- and column-specific load and store, provides matrix transpose capability

3. Table of New Multiply, Convolution and Motion Estimation Instruction Codes

For the convenience of the programmer, some of the instructions can be coded with either two or three operands. When only two operands are coded, the statement, "uses register 3", means that the FRy field is coded as "3". However, regardless of whether two or three operands are listed in the assembly code, the actual source of the .data is vm (memory), not

Fixed Register 3, when register 3 is selected.

The new instruction codes and their mnemonics are:

4. Multiplication Instructions (v_mvm_x , v_conv_x, v_conv2_x)

Three types of new instructions perform 32 multiplications and additions simultaneously:

1) matrix- vector multiply - compute eight four-point sums-of-products simultaneously from a single set of four operands from memory 2) convolution - compute eight four-point sums-of-products simultaneously from eight overlapping sets (four operands per set) of operands from memory

3) convolution-pair - compute eight four-point sums-of-products simultaneously from eight overlapping sets (four operands per set) of operands from memory, where pairs of convolutions use the same operands from memory; used for wavelets.

Two instructions are provided for each data type supported. One (v_mvm_x, v_conv_x and v_conv2_x) performs a sum-of-products, which forces a zero into the final adder rather than feeding the selected accumulator back into the adder. It is used as the first instruction in a series of multiply-adds. The other instruction (v_mvmadd_x, v_convadd_x and v_conv2add_x) performs a sum-of-products with accumulation, which feeds the selected accumulator back into the final adder. It is used for the second and subsequent instructions in a series of multiply-adds.

Use of the v__mul_xxx and v_muladd_xxx instructions is made where each multiplication and addition is done within a single parallel arithmetic unit 22. The instructions can only load ACO and do not bypass the register windows 22 A .

4. a Change in Use of Vector Register Addresses

The use of the vector register A address is modified from its usual form to select one-of-four "A" operands as an input to the multipliers and to select one-of-eight accumulators as a destination for the sum of products. Thus the instructions have three parameters, vra, vrb and ac, rather than the normal two, vra and vrb.

For these instructions the use of vector register A address is structured as: bl ..0 - operand selection 0 - FRO

1 - FR1 2 - FR2

3 - vm, memory (register address OxC), this is the default since the proper distribution of operands from memory to the parallel arithmetic units 22 requires that the operands be read from memory. b4.2 - accumulator selection

0 - ACO, 32b accumulator 0 (accumO) in VPO

1 - AC1, 32b accumulator 1 (accuml) in VPO

2 - AC 2, 32b accumulator 0 (accumO) in VPl

3 - AC3, 32b accumulator 1 (accuml) in VPl

4 - AC4, 32b accumulator 0 (accumO) in VP2

5 - AC5, 32b accumulator 1 (accuml) in VP2

6 - AC6, 32b accumulator 0 (accumO) in VP3

7 - AC7, 32b accumulator 1 (accuml) in VP3

The use of vector register B address is modified from its usual form to provide a linear 1-of- 32 selection of the registers in the windowed register bank 22 A. No special purpose or fixed registers can be accessed by the B address for these instructions.

For these instructions the Instruction Unit 16 forces the vector ALUs to do an "F = A" operation, the same as is done for a normal multiply instruction, e.g., v__mul_qwps. No updating of the windowed registers 22 A is performed. Only the selected accumulator (accumO or accuml) in one of the four parallel arithmetic units 22 in each row is updated.

The Scalar Register A, Scalar Register B and Scalar Opcode fields behave normally.

4,b Matrix- Vector Multiply Instructions

Four 8- or 16b operands are fetched from memory or thirty-two 16b operands are fetched from registers. When the operands are fetched from memory and 8b operands are used, each is converted to a 16b value by either sign-extension or padding with leading zeroes, depending upon whether the data type is signed or unsigned. Each operand is fed to all of the parallel arithmetic units 22 in a given column of the Parallel Processing Unit 20.

A four-point sum-of-products is computed in each row of the Parallel Processing Unit 20 across the four parallel arithmetic units 22 and stored in one of the two accumulators in one of the four parallel arithmetic units 22 in that row.

The Scalar Register A, Scalar Register B and Scalar Opcode fields behave normally. The uses of the Vector Register A, Vector Register B and Vector Opcode fields are modified.

The six types of 1-D parallel data structures of operands that are handled are:

<—

4- Ptr

<^•••• Ptr

Ptr

The 2-D parallel data types supported are: a) octal quad word packed signed (_oqwps) - for matrix- vector multiplication

b) octal quad word interleaved signed (_oqwis) - for matrix- vector multiplication

c) octal quad byte packed signed (_oqbps) - for matrix-vector multiplication

d) octal quad byte packed unsigned (_oqbpu) - for matrix-vector multiplication

e) octal quad byte interleaved signed (_oqbis) - for matrix- vector multiplication

f) octal quad byte interleaved unsigned ( oqbiu) - for matrix- vector multiplication

g) octal octal byte packed signed (_oobps) - for motion estimation

h) octal octal byte packed unsigned (_oobpu) - for motion estimation

i) octal octal byte interleaved signed (_oobis) - for motion estimation

j) octal octal byte interleaved unsigned (_oobiu) - for motion estimation

The "octal" modifier refers to the fact that each instruction operates upon a 2-D data structure with eight rows that share a common input, which is specified by the remainder of the name of the parallel data type. The "quad" modifier refers to the fact that each row of the data structure has four columns, i.e., four operands. Thus a total of 32, 16b coefficients, stored in the windowed registers 22A in the parallel arithmetic units 22, is required to fully utilize each instruction.

The one type of 2-D parallel data structure, which has 16b signed coefficients, that is handled by each instruction is:

Ptr

When octal quad byte packed data is used, four 8b operands from memory are fed to the parallel arithmetic units 22 (VP3..VP0) as follows, where M = S for sign-extension for signed operands or 0 for unsigned operands (some values are highlighted to emphasize the pattern):

When octal quad byte interleaved data is used, four 8b operands from memory are fed to the parallel arithmetic units (VP3..VP0) as follows, where M = S for sign-extension for signed operands or 0 for unsigned operands (some values are highlighted to emphasize the pattern):

When octal quad packed word data is used, four 16b operands from memory are fed to the parallel arithmetic units (VP3..VP0) as follows (some values are highlighted to emphasize the pattern):

When octal quad word interleaved data is used, four 16b operands from memory are fed to the parallel arithmetic units (VP3..VP0) as follows (some values are highlighted to emphasize the pattern):

4,c Convolution Instructions

Eleven 8b operands are fetched from memory. Each operand is converted to a 16b variable by either sign-extension or padding with leading zeroes, depending upon whether the data type is signed or unsigned. The operands are fed to the parallel arithmetic units in a staggered fashion to implement eight overlapping convolutions. Each convolution is stepped 1 byte from the one in the next lower row of the Parallel Processing Unit.

no A four-point sum-of-products is computed in each row of the Parallel Processing Unit 20 across the four parallel arithmetic units 22 and stored in one of the two accumulators in one of the four parallel arithmetic units 22 in that row.

The two types of 1-D parallel data structures of operands that are handled are (only the operands used by row 0 of the parallel processing unit are shown):

<- Ptr

The new two-dimensional parallel data types supported are: a) octal quad byte packed signed (_oqbps)

b) octal quad byte packed unsigned (_oqbpu)

4-

Eleven 8b operands from memory are fed to the parallel arithmetic units 22 (VP3..VP0) as follows, where M = S for sign-extension for signed operands or 0 for unsigned operands (some values are highlighted to emphasize the pattern):

4.d Convolution-Pair Instructions

These instructions are intended for use with, for example, wavelets, which need pairs of convolutions, or with convolutions that require 16b memory operands. Note that the maximum total size of memory operands that can be fetched in one cycle is sixteen bytes, which is the width of the Data Cache 11 and limits the number of convolutions that can be implemented by a single instruction when 16b memory operands are used.

Seven 8- or 16b operands, or thirteen 8b operands (not all used) are fetched from memory. Each 8b operand that is used is converted to a 16b variable by either sign-extension or padding with leading zeroes, depending upon whether the data type is signed or unsigned. Operands are fed to the parallel arithmetic units in a staggered fashion to implement eight overlapping convolutions. Each pair of convolutions is stepped 1 byte from the one in the next lower pair of rows of the Parallel Processing Unit 20.

A four-point sum-of-products is computed in each row of the Parallel Processing Unit 20 across the four parallel arithmetic units and stored in one of the two accumulators in one of the four parallel arithmetic units 22 in that row.

The Scalar Register A, Scalar Register B and Scalar Opcode fields behave normally The uses of the Vector Register A, Vector Register B and Vector Opcode fields are modified.

The five types of 1-D parallel data structures of operands that are handled are (only the operands for the first row are shown):

<— Prr

Ptr

The new two-dimensional parallel data types supported are: a) octal quad word packed signed (_oqwps)

b) octal quad byte packed signed (_oqbps)

c) octal quad byte packed unsigned (_oqbpu)

d) octal quad byte interleaved signed (_oqbis)

e) octal quad byte interleaved unsigned (_oqbiu) The "octal" modifier refers to the fact that each instruction operates upon a 2-D data structure with eight rows that share a common input, which is specified by the remainder of the name of the parallel data type. The "quad" modifier refers to the fact that each row of the data structure has four columns, i.e., four operands. Thus a total of 32, 16b coefficients, stored in the windowed registers in the parallel arithmetic units, is required to fully utilize each instruction.

When octal quad byte packed data is used, seven 8b operands from memory are fed to the parallel arithmetic units (VP3..VP0) as follows, where M = S for sign-extension for signed operands or 0 for unsigned operands (some values are highlighted to emphasize the pattern):

When octal quad byte interleaved data is used, seven 8b operands from memory out of a field of 13 bytes are fed to the parallel arithmetic units (VP3 VPO) as follows, where M = S for sign-extension for signed operands or 0 for unsigned operands (some values are highlighted to emphasize the pattern)

When octal quad word packed data is used, seven 16b operands from memory are fed to the parallel arithmetic units (VP3..VP0) as follows (some values are highlighted to emphasize the pattern):

5. New Motion Estimation Instructions (v_PixDi s tStaggered_x , v_PixDistLinear_x, v_PixBestStaggered_x and v_PixBestLinear_x)

The new, as compared to the A236 chip, v_PixDistStaggered_x and v PixBestS tagger ed_x instructions enable eight overlapping, 8-pixel, pixel distance calculations to be implemented simultaneously, processing a total of 64 pixels per instruction. These instructions enable the same pattern to be checked in eight overlapping locations simultaneously.

The new v_PixDistLinear_x and v_PixBest inear_x instructions enable eight 8- pixel, pixel distance calculations to be implemented simultaneously with the same eight memory operands, also processing a total of 64 pixels per instruction. These instructions enable eight patterns to be checked in the same location simultaneously.

In either case, with a 100 MHz CPU clock, data is processed at the rate 6.4 billion pixels per second, which takes the equivalent of more than 50,000 RISC MIPS.

Since multiple motion estimation calculations are performed simultaneously, a 16 x 16 window of pixels can be processed, on average, every 4 (four) CPU cycles. This is 64 times faster, on a CPU cycle-count basis, than conventional processors that are intended for image compression applications.

All of the parallel arithmetic units 22 in each row of the Parallel Processing Unit 20 cooperate to perform the motion estimation calculation. Thus the total number of registers available in each row of the Parallel Processing Unit 20 for storing search targets is 4 PAU's/row x 32 words/PAU = 128 words/row = 256 bytes/row.

These "staggered" instructions route data through the Parallel Processing Unit 20 like the convolution instructions do. The data that is read from memory is fed to successive motion estimation coprocessors 24 in a staggered fashion, advancing from one to the next by one byte to implement eight overlapping pixel distance calculations.

Every instruction, fifteen 8b operands are fetched from memory. The Non-aligned Data Cache 11 enables this fetch to be made at full CPU speed regardless of the address of the operands, i.e., regardless of the alignment of the operands in memory. If all eight motion estimation coprocessors 24 are searching for the same pattern, which is typical, then the address is stepped by 8 bytes every instruction. The v_PixDistY_x instruction is used for the first and most subsequent calculations for a given position of a window. The v_PixBestY_x instruction is used for the last calculation for a given position of a window except that v_PixBestY_x is also used if there is only one calculation for a given position of a window. Use of the v_PixBestY_x instructions updates the Pixel Best registers, which keep track of the best match.

The standard definition of pixel distance is used: i.e., the sum of the absolute values of the differences between sets of pixels. Note that the PixelBest calculation is increased from 8 bits to 12 bits to handle a larger search distance.

5. a Change in Use of Vector Register Addresses

For these instructions the use of vector register A address is structured as: bl..O - operand selection

0 - FRO (register address 0)

1 - FR1 (register address 1)

2 - FR2 (register address 2)

3 - vm, memory (register address OxC), this is the default since the proper distribution of operands from memory to the parallel arithmetic units requires that the operands be read from memory b4..2 - additional opcode bits

The use of vector register B address is modified from its usual form to provide a linear 1-of- 32 selection of the registers in the windowed register bank 22 A. No special purpose or fixed registers can be accessed by the B address.

The Scalar Register A, Scalar Register B and Scalar Opcode fields behave normally. The uses of the Vector Register A, Vector Register B and Vector Opcode fields are modified. 5,b Parallel Data Types Supported

The new parallel data types supported for motion estimation are: a) octal octal byte packed signed ( oobps)

b) octal octal byte packed unsigned (_oobpu)

The "octal" modifier refers to the fact that each instruction operates upon a 2-D data structure with eight rows that share a common input, which is specified by the remainder of the name of the parallel data type. The "octal" modifier refers to the fact that each row of the data structure has eight columns, i.e., eight operands. Thus a total of 64, 8b values, stored in the windowed registers 22 A in the parallel arithmetic units 22 as 32, 16b coefficients, is required to fully utilize each instruction.

The two types of 1-D 8b operands that are handled are (only the operands for the first row are shown):

<- Ptr

The one type of 2-D parallel data structure, which uses 8b search targets, that is handled by each instruction is:

Fifteen 8b operands from memory are fed to the parallel arithmetic units (VP3..VP0), which act as dual 9b (not 8b) processors to handle the sign bit properly, as follows (some values are highlighted to emphasize the pattern):

6. Row- and Column-Specific Load and Store Instructions

Parallel operands can be moved between memory and registers in the parallel arithmetic units 22 in row- and column-order. The instructions to this are in the Load and Store Groups. A matrix transpose operation can be implemented efficiently, moving up to 16 bytes between registers and memory in a single instruction.

Since specific rows and columns are selected by the instruction, it is not necessary to use the vector processor enable bits to select which parallel arithmetic units 22 are selected. However, the vector processor enable bit in a given parallel arithmetic unit should be true for data to be loaded into that unit, or for data from that unit to be stored into memory. 6. a Change in use of Vector Register Addresses

The vector register A and B address fields are reversed in the store instructions to simplify instruction decoding. In addition, the use of the vector register A and B addresses fields are modified for both load and store instructions to allow the specification of which row(s) or column(s) of the Parallel Processing Unit 20 is to be used.

For a memory read, or load, the Instruction Unit 16 forces the parallel arithmetic units 22 to use the "vm" register address as the vector register A address to read memory. The vector register A address field in the instruction word is redefined to select a particular form of row or column access, and to chose a particular row or column of the Parallel Processing Unit 20. The Instruction Unit 16 also forces the parallel arithmetic units 22 to use the vector ALU function "F = A" to move data from memory to any selected register (except memory).

For a memory write, or store, the usage of the vector register A and B address fields is reversed. In addition, the Instruction Unit 16 forces the parallel arithmetic units 22 to use the "vm" register address as the vector memory B address to write to memory. The vector register B address field in the instruction word is redefined to select a particular form of row or column access, and to chose a particular row or column of the Parallel Processing Unit 20. The Instruction Unit 16 also forces the parallel arithmetic units 22 to use the vector ALU function "F = A" to move data from any selected register (except memory) to memory.

6 b Parallel Data Types Supported

When an unsigned byte is loaded into a parallel arithmetic unit 22 from memory, it is given eight leading zeroes to convert it to a 16b value. When a signed byte is read from memory, it is given a sign extension of eight leading bits from bit 7 of the byte to convert it to a 16b two's complement (signed) value.

For memory read, or load, and write, or store, the six parallel data formats supported for row- specific access are: (1) quad word interleaved signed (_qwis) - select one row

(2) octal byte interleaved signed|unsigned (_obis and _obiu) - select two rows, rows N and N+l, where N = even

(3) octal word packed signed (_owps) - select two rows

(4) sixteen byte packed signed|unsigned (_sbps and _sbpu) - select four rows, rows N through N+3, where N = 0 or 4

For memory read, or load, and write, or store, the five parallel data formats supported for column-specific access are:

1) octal byte interleaved signed|unsigned (_obis and _obiu) - select one column

2) octal word packed signed (_owps) - select one column

3) sixteen byte packed signed|unsigned (_sbps and _sbps) - select two columns

The five new types of 1-D parallel data structures of operands that are handled by the Load and Store Instructions are:

Ptr

<- Ptr

- Ptr

7. Table of Load Instructions

The instructions are, where "r" means row and "c" means column:

8. Table of Store Instructions

NOTE: The Load instructions have the vector register B value in its usual place. However, the Store instructions reverse the positions of the vector register A and B fields so that there is a common format for the Instruction Unit 16 to decode the row- and column-select codes. Thus, while the sequence of assembler mnemonics is unchanged, the binary pattern for the use of the vector register A and B fields is reversed for the Store instructions.

The _trunc specification is the default and is optional.

The instructions are, where "r" means row and "c" means column:

Chapter 3 : Memory Access Patterns

A. Connection of Crossbar Data Bus 29 to the Parallel Arithmetic Units 22

Column 0 = all eight VPn,0's Column 1 = all eight VPn,l's Column 2 = all eight VPn,2's Column 3 = all eight VPn,3's

The parallel arithmetic units 22 operate with 16b precision except when motion estimation and sixteen byte compressed signed instructions are performed.

With the exception of motion estimation, on memory read an unsigned byte is converted to a 16b word by placing the byte in the 8 lsbs of the 16b word and padding with 8 leading 0's. A signed byte is converted to a 16b word by placing the byte in the 8 lsbs of the word and sign- extending the byte to 16 bits.

On memory writes using instructions with the "_sat" modifier, the 16b word is saturated according to the type of operand being written and the result is written to memory. If the data type is unsigned, then any negative value is converted to 0 and any positive value greater than OxFF is converted to OxFF. If the data type is signed, any value more negative than 0x80 is converted to 0x80 and any positive value greater than 0x7F is converted to 0x7F. When a write to memory is performed by instructions with either no modifier (the default) or the "_trunc" modifier, a 16b result is simply truncated to 8b.

It is pointed out that the Crossbar Switch 27 may actually be considered to be an "Interconnection Network" 27 and, in fact, forms a first level of a two-level interconnection heirarchy. The Interconnection Network 27 operates, in accordance with an aspect of this invention, to access a set of up to 16 contiguous bytes, starting with the specified starting address of the first byte, from the Data Cache 11. Some sub-set of 16 contiguous bytes may be accessed, on an instruction by instruction basis, in order to save power. The accessed set of 16 bytes, or the accessed subset (e.g., 8 bytes), may cross a cache page. A second level of the two-level interconnection heirarchy can be found in each parallel arithmetic unit 22, and is implemented with bus driver 22G, byte select logic 22H, and MUXes 221 and 22J. This logic selects, on a Data Cache 11 read (MUX 221), one or two bytes out of the field of 16 (or fewer) bytes appearing on the Crossbar Data Bus 29. In this manner, and by example, ability to direct different groups of operands to different members of its parallel processing units, depending upon the instruction. The MUX 22J is employed when writing data back to the Data Cache 11.

It should be noted that this logic need not be present in each of the parallel arithmetic units 22, but could be provided on a per-row basis. Also, and referring to Figure 8, it should be noted that the Crossbar Data Bus 29 need not be provided as four separate buses for fully interconnecting the n x m array of parallel arithmetic units 22, but could be provided as one read bus and one write bus, with suitable multiplexing logic being provided within each row, or even as a single read/write bus.

The foregoing circuitry and method of operation aids in realizing various objects and advantages of this invention. For example, it facilitates providing the technique for accessing multiple operands from the Data Cache 11, where all of the operands can be accessed in a single clock cycle, regardless of the address of the first of the multiple operands and regardless of the placement of the set of parallel operands within one or more cache pages. It also facilitates providing a digital signal processor device having instructions that feed one set of operands to a first group of parallel processing units, a second set of operands that contain a portion but not all of the first set of operands to a second group of parallel processing units, and so on, wherein but one set of operands may be fed to the first two groups of parallel processing units, a second set of operands that contain a portion but not all of the first set of operands are fed to third and fourth groups of parallel processing units, and so on. The disclosed circuitry and method of operation also facilitates providing a technique for writing only selected operands to memory from a set of parallel processing devices 22, where the connection of the parallel processing devices to memory can vary from one instruction to the next.

B. Memory Access Patterns for Load and Store

The following tables show all possible ways that the parallel arithmetic units 22 send data to the Data Cache. Some of the ways are specific to a few instructions, while others are used by multiple instructions. Note: 1) On a memory write that does not select a specific row or column of parallel arithmetic units, the Vector Processor Enable bits in the various parallel arithmetic units determine which parallel arithmetic 22 units drive the Data Cache Data Bus 25. A conflict arbitrator allows only the parallel arithmetic unit 22 in the lowest-numbered row to drive a given byte if multiple parallel arithmetic units are enabled. This statement is repeated with some of the tables, but applies regardless of whether the statement is repeated or not.

2) On a row- or column-memory write, Byte n, Bn, is read from the selected row(s) or column of parallel arithmetic units and sent to the Data Cache.

The following tables show all possible ways that the parallel arithmetic units receive data from the Data Cache 11. Some of the patterns are specific to a few instructions, while others are used by multiple instructions.

On a memory read, Byte n, Bn, is read from the Data Cache 11 and loaded into the one or more rows and columns of parallel arithmetic units 22 specified. When a byte is loaded into a parallel arithmetic unit 22, which has 16b precision, an unsigned operand is given 8 leading 0's, while the sign bit of a signed operand is extended into the 8 highest bits.

Note: Only parallel arithmetic units 22 whose Vector Processor Enable bits are set are loaded. This statement is repeated for emphasis with some of the tables, but always applies regardless of whether it is repeated or not.

1. Quad byte packed signed|unsigned (in A236) v_move_qbps I u vm, vrb is now implemented as Sixteen byte packed signed|unsigned

The quad byte packed signed|unsigned parallel data type in the A236 is now implemented as sixteen byte packed signed|unsigned.

2. Quad byte interleaved signed|unsigned (in A236) v_move_qbis | u vm , vrb is now implemented as Octal byte interleaved signed | unsigned

The quad byte interleaved signed|unsigned parallel data type in the A236 is now implemented as octal byte interleaved signed|unsigned. 3. Quad word packed signed (in A236) v_move_qwps vm, vrb is now implemented as Octal word packed signed

The quad word packed signed parallel data type in the A236 is now implemented as octal word packed signed.

4. Quad word interleaved signed v_move_qwis vm, vrb and v_move_qwis | u vra , vm

All parallel arithmetic units 22 whose vector processor enable bits are set are loaded.

5. Row-specific quad word interleaved signed v_loadr_qwis and v_storer_qwis

Note that a parallel arithmetic unit 22 is loaded only if its vector processor enable bit is set AND its row is selected by the instruction.

6. Octal byte interleaved signed|unsigned v_move_obis | u vm, vrb and v_move_obis | u (_sat | _trunc ) vra , vm

These operations are also performed for quad byte interleaved signed|unsigned.

All parallel arithmetic units 22 whose vector processor enable bits are set are loaded. M = S for signed, 0 for unsigned.

Byte data is saturated when it is stored according to whether it is signed or unsigned, or else it is truncated.

7. Octal word packed signed v_move_owps vm, vrb and v_move_owps vra , vm

These operations are also performed for quad word packed signed.

8. Row-specific octal byte interleaved signed|unsigned v_loadr_obis | u and v_storer_obis | u (_sat | _trunc )

M = S for signed, 0 for unsigned. "None" means that nothing is loaded in that position. Note that a parallel arithmetic unit 22 is loaded only if its vector processor enable bit is set AND its row is selected by the instruction.

This instruction is very useful for matrix transpose. N = even and is selected by the instruction. The instruction coding does not allow any other values to be selected.

Byte data is saturated when it is stored according to whether it is signed or unsigned, or truncated. N = even and is selected by the instruction. The instruction coding does not allow any other values to be selected.

9. Row- specific octal word packed signed v_loadr_owps

10. Sixteen byte packed signed|unsigned v_move_sbps | u vm, vrb and v_move_sbps | u vra , vm (_sat | _trunc )

These operations are also performed for quad byte packed signed|unsigned.

All parallel arithmetic units 22 whose vector processor enable bits are set are loaded. M = S for signed, 0 for unsigned (8 msbs = 0's). N = 0 AND 4.

N = both 0 and 4. Byte data is saturated when it is stored according to whether it is signed or unsigned, or else it is truncated.

11. Row-specific sixteen-byte packed signed|unsigned v_loadr_sbps | u and v_store_sbps | u (_sat | _trunc )

M = S for signed, 0 for unsigned. N = 0 or 4 and is selected by the instruction. The instruction coding does not allow any other values to be selected.

N = 0 or 4 and is selected by the instruction. The instruction coding does not allow any other values to be selected. Byte data is saturated when it is stored according to whether it is signed or unsigned, or else it is truncated.

12. Column-specific octal byte interleaved signed|unsigned v_loadc_obis | u and v_storec_obis | u (_sat | _trunc )

M = S for signed, 0 for unsigned. "None" means that nothing is loaded in that position. Note that a parallel arithmetic unit 22 is loaded only if its vector processor enable bit is set AND its column is selected by the instruction. This instruction is very useful for matrix transpose.

N = 3..0 and is selected by the instruction.

Byte data is saturated when it is stored according to whether it is signed or unsigned, or else it is truncated. This instruction is very useful for matrix transpose.

N = 3..0 and is selected by the instruction.

13. Column-specific octal word packed signed v_loadc_owps and v_storec_owps

"None" means that nothing is loaded in that position. Note that a parallel arithmetic unit 22 is loaded only if its vector processor enable bit is set AND its column is selected by the instruction.

This instruction is very useful for matrix transpose.

N = 3..0 and is selected by the instruction.

14. Column-specific sixteen byte packed signed |unsigned v_loadc_sbps | u and v_storec_sbps | u (_sat | _trunc )

M = S for signed, 0 for unsigned. "None" means that nothing is loaded in that position. Note that a parallel arithmetic unit 22 is loaded only if its vector processor enable bit is set AND its column is selected by the instruction.

This instruction is very useful for matrix transpose.

N = 2 or 0 and is selected by the instruction. The instruction coding does not allow any other values to be selected.

C. Memory Access Patterns for Multiplication and Pixel Distance

1. Convolution Octal quad byte packed signed|unsigned v_conv_oqbps | u

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of a convolution. M = S for signed, 0 for unsigned. Note that the address of the data advances by 1 byte every row. Certain bytes are highlighted to emphasize the fact that different sets of bytes are presented to each row of the Parallel Processing Unit.

2. Convolution-pair Octal quad word packed signed v_conv2_oqwps

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of a convolution. Note that the address of the data advances by 2 bytes every pair of rows Certain bytes are highlighted to emphasize the fact that different sets of bytes are presented to each pair of rows of the Parallel Processing Unit 20

3. Convolution-pair Octal quad byte packed signed|unsigned v_conv2_oqbps | u

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of a convolution. M = S for signed, 0 for unsigned. Note that the address of the data advances by 1 byte every pair of rows. Certain bytes are highlighted to emphasize the fact that different sets of bytes are presented to each pair of rows of the Parallel Processing Unit 20

4 Convolution-pair Octal quad byte interleaved signed|unsigned v_conv2_oqbis | u

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of a convolution. M = S for signed, 0 for unsigned Note that the address of the data advances by 2 bytes every pair of rows. Certain bytes are highlighted to emphasize the fact that different sets of bytes are presented to each pair of rows of the Parallel Processing Unit 20.

5. Motion Estimation Staggered Octal octal byte packed signed|unsigned v_PixDistStaggered_oobps | u and v_PixBestStaggered_oobps | u

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of motion estimation, which is a convolution-like operation. Note that the address of the data advances by 1 byte every row. Certain bytes are highlighted to emphasize the fact that different sets of bytes are presented to each row of the Parallel Processing Unit 20.

Note that two bytes are provided to each parallel arithmetic unit 22, unlike the usual format of providing only one for byte operands.

6. Motion Estimation Linear Octal octal byte packed signed|unsigned v_PixDistLinear_oobps | u and v_PixBestLinear_oobps | u

Data is not loaded into the parallel arithmetic units 22, but is provided for use in the computation of motion estimation. Note that the same data is fed to every row, unlike for the Staggered form of motion estimation. Certain bytes are highlighted to emphasize which bytes are presented to each row of the Parallel Processing Unit 20.

7. Motion Estimation Linear Octal octal byte interleaved signed|unsigned v_PixDist inear__oobis | u and v_PixBestLinear_oobis | u

Chapter 4: Instruction Pipeline

All instructions in the A436 chip 10 issue at the rate of one per CPU Clock. This is an improvement from the A236 chip, where Call and hardware interrupt took two instruction cycles to save a return address on the stack, and RET and RTI took two cycles to recover the return address from the stack.

DCache = The Data Cache 11, a 128b-wide sync. SRAM that, on the rising edge of CPU

Clock, receives an address and reads or writes the desired data.

ICache = The Instruction Cache 28, a 64b-wide sync. SRAM that, on the rising edge of

CPU Clock, receives an address and reads the desired data.

Length = 2, 4 or 8 bytes, of the instruction being decoded at ICache b31..0. The address of this instruction is given by the PC in the previous cycle.

PC = The Program Counter 26B (see Figure 7) gives the address of the previous instruction. SP = Stack Pointer. AP = Address Pointer (operation is similar to SP). A. Detailed description of operation of instruction pipeline including memory access and transfer of control

Fetch Stage (Stage 1 / SI)

Near the beginning of the cycle (as soon as Length is available from S2 from the previous instruction): Instruction Address = (PC + Length) — » ICache Controller unless there is a transfer of control, in which case Instruction Address = Transfer Address. Length can be 2, 4 or 8 bytes.

ICache tag logic is then checked asynchronously to verify that the instruction is in the ICache. If not, a miss occurs and the machine stalls until it is available. A 64b "extended" instruction can be on any 4-byte boundary and span two cache pages, so zero, one or two cache misses can occur on a single instruction fetch.

At the end of the cycle, PC is updated with the same value that was given to the ICache Controller earlier in this cycle. Reading the PC as a scalar register gives the value "Scalar PC", which is delayed one cycle from "Decode PC," which is always "PC + Length, regardless of a transfer of control. Scalar PC is automatically pushed onto the stack by Call, but must be explicitly saved by hardware-interrupt service routines with a Push pc , smem32, so that execution of the original program can continue upon return.

ICache b63..0 are loaded at the end of the cycle with the 64 bits read from the ICache at Instruction Address, which can be any 4-byte address (8-byte alignment not required).

Decode Stage (Stage 2 / S2)

ICache b31..0 = Current Instruction. Next Instruction b31..0 are at ICache b63..32 unless a 64b instruction is at ICache b63..0. The Current and Next Instructions are decoded to see if it is ok to allow a hardware interrupt to preempt the Next Instruction by sending the interrupt vector (jump address) to the ICache Controller instead of PC + Length. The Current Instruction is decoded to see if its Length is 2 or 8, otherwise it is 4. Length is immediately used by S 1 to address the next instruction and form Decode PC, and update PC at the end of the cycle. Rapid decoding is required since the value is needed to address the next instruction. Also, decoding of the parallel data type begins for control of the Crossbar 27, Data Cache 11 and parallel units 22.

At the end of the cycle: Current Instruction — » Scalar Instruction Register. In addition, a scalar memory read is issued if the Current Instruction requires the operand in Stage 3. With the exception of POP and RET, the read requires two instructions (one for address then one for data) and cannot be separated from the preceding scalar instruction, which provided the address, thus precluding a hardware interrupt at this time.

POP and RET send the Stack Pointer to the Data Cache Controller for a memory read. The Stack Pointer is updated without use of Scalar ALU because it is an up/down counter.

Scalar Execution Stage (Stage 3 /S3)

Using the contents of the Scalar Instruction Register, scalar registers are accessed, scalar ALU operation performed and, at the end of the cycle, the ALU result returned to a scalar register.

Early in the cycle, except as below, the non-updated scalar register B is sent to DCache Controller. It is used as the address for a vector memory read at the end of this cycle, or held and used next cycle for a vector memory write or a scalar memory write.

Scalar data sent to memory during this cycle is written at the end of this cycle using the address provided by the previous scalar instruction and held in the DCache Controller, except that Push and Call send the Stack Pointer (sp, which is updated without the ALU because it is an up/down counter) and data in this cycle. The use of ap is similar to sp.

The software tools avoid programmed conflicts in accessing memory. The tools insert a NOP if needed so a previously issued memory access can complete before a new one is issued. A second stage of decoding of the parallel data type controls the Crossbar 27, Data Cache 11 and parallel units 22 .

At the end of the cycle, vector portion of Scalar Instruction Register — > Vector Instruction Register.

Programmed transfer of control occurs here, giving a 1 -cycle delayed branch because the Next Instruction has already been fetched and is at ICache b31..0.

1st Vector Execution Stage (Stage 4 / S4)

Using the contents of the Vector Instruction Register (part of which is in each parallel arithmetic unit), vector registers are accessed, vector ALU operations performed and results returned to vector registers. The first stage of multiplication and motion estimation pipelines are executed.

Parallel memory data that was addressed in the previous cycle is available for reading. Parallel data sent to memory this cycle is written at the end of this cycle using the address computed the previous cycle and held in the DCache Controller for use this cycle.

2nd Vector Execution Stage (S5)

Any remaining control bits that were decoded from the vector portion of the instruction -» second (final) stages of multiplication and motion estimation pipelines.

Referring now to the Figures, Figure 11 is a chart showing an example of a hardware interrupt; Figure 12 is a chart showing an example of a jump; Figure 13 is a chart showing an example of a call; and Figure 14 is a chart showing an example of a return. Chapter 5: Programming Examples

A. Matrix- Vector Multiply

The A436 chip 10 can compute matrix-vector multiplies very quickly. Eight 4-point sums- of-products are computed by a single 32b instruction, including the fetching of packed or interleaved (alternating) parallel operands from memory regardless of their address (or alignment), and the conversion from 8b signed or unsigned format to 16b signed format.

Two versions of the matrix-vector multiply instruction are provided, v_mvm_x and v_mvmadd_x. The first instruction starts a sum-of-products, where there is no accumulated sum-of-products. The second instruction is for the second and subsequent sums-of-products, where there is an accumulated sum-of-products.

The matrices used in the example are:

[Pixel Matrix "Pxy"] X [Coefficient Matrix "Cxy"] = [Frequency Matrix "Fxy"]

The Pixel Matrix is stored in memory. When images are captured by one of the A436's Parallel DMA Ports, all rows of pixels are allocated the same size of line buffer by the Parallel DMA Port that loads the images into memory. The size of this line buffer is chosen to optimize the performance of the Data Cache 11.

The eight rows of parallel arithmetic units 22 in the Parallel Processing Unit 20 are shown vertically in the coefficient matrix and frequency matrix. The four columns of parallel arithmetic units 22 in the Parallel Processing Unit 20 are shown horizontally in the frequency matrix; each column is shown twice to reflect the fact that there is a total of eight accumulators in each row of the Parallel Processing Unit 20. ACx refers to the fact that accumulator "x" (AC7..AC0) in all of the eight (enabled) parallel arithmetic units 22 in an entire column of the Parallel Processing Unit 20 is loaded.

The following abbreviations are used: ACx = accumulator "x" (in each row of the Parallel Processing Unit), x = 7..0

DST = destination

FRx = Fixed Register "x" in each parallel arithmetic unit in the Parallel Processing Unit, x = 3..0 where FR3 give vm (memory access)

@SRx = memory location addressed by the contents of Scalar Register "x", x = 23d..0

SR8 - Scalar Register 8 is used for sake of example to hold the "stride", the offset from one group of four pixels to the next: 4 for packed bytes, 8 for interleaved bytes or packed words, or 16 for interleaved words. SRC = source, can be memory or a register in the Parallel Processing Unit. WRx = Windowed Register "x" in each parallel arithmetic unit in the Parallel Processing Unit, x = 3 ld..O

The following notation is used:

Note: The first four rows of results are shown using even-numbered accumulators so that the set of four accumulators "0" (ACO, AC2, AC4 and AC6) can be referred to by accumOL. The second four rows of results are shown using odd-numbered accumulators so that the set of four accumulators "1" (ACl, AC3, AC5 and AC7) can be referred to by accumlL. This is useful for bit-realignment of the results when bytes are multiplied by words.

Instruction 1 : v_mvm_x WRO , ACO s_add SR8 , SRO reads the first four pixels in the first row and multiplies them times the first four rows of coefficients, computing the first row of partial sums-of-products (S's) storing them in one of the accumulators, e.g., ACO, in each row of the Parallel Processing Unit 20:

I Stored in memory Row in Parallel Processing Unit Row in Parallel Processing Tlnif

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Instruction 2: v_mvmadd_x WR1 , ACO s_add SR8 , SRO reads the second four pixels in the first row and multiplies them times the last four rows of coefficients, adding the results to the earlier partial sums-of-products, completing the first row of sums-of- products (F's) and storing them in ACO in each row of the Parallel Processing Unit 20:

Instruction 3 : v_mvm_x WRO , AC2 s_add SR8 , SRI reads the first four pixels in the second row and multiplies them times the first four rows of coefficients, computing the second row of partial sums-of-products (S's), storing them in another one of the accumulators, e.g., ACl in each row of the Parallel Processing Unit:

ET Stored in memory

Instruction 4: v_mvmadd_χ WR1 , AC2 s_add SR8 , SRI reads the second four pixels in the second row and multiplies them times the last four rows of coefficients, adding the results to the earlier partial sums-of-products, completing the second row of sums-of- products (F's) storing them in AC2 in each row of the Parallel Processing Unit 20:

Stored In memory

This process continues for an additional six pairs of v_mvm_x and v_mvmadd_x instructions to complete the first dimension of an 8 x 8 DCT:

v_mvm_x WR0,AC4 s_add SR8,SR2 ;row 3 (left 4) of pixels v_mvmadd_x WR1,AC4 s_add SR8,SR2 ; row 3 (right 4) of pixels v_mvm_x WR0,AC6 s add SR8,SR3 ;row 4 (left 4) of pixels v_mvmadd_x WR1,AC6 s_add SR8,SR3 ; row 4 (right 4) of pixels v_mvm_x WRO, Cl s_add SR8,SR4 ; row 5 (left 4) of pixels v_mvmadd_x WR1,AC1 s_add SR8,SR4 ; row 5 (right 4) of pixels v_mvm_x WRO , AC3 s_add SR8 , SR5 ; row 6 (left 4) of pixels v_mvmadd_x WR1,AC3 s_add SR8,SR5 ; row 6 (right 4) of pixels v_mvm_x WR0,AC5 s add SR8,SR6 ; row 7 (left 4) of pixels v_mvmadd_x WR1,AC5 s_add SR8,SR6 ; row 7 (right 4) of pixels v_mvm_x WR0,AC7 s add SR8,SR7 ; row 8 (left 4) of pixels v mvmadd x WR1,AC7 s add SR8,SR7 ; row 8 (right 4) of pixels B. 2-D Discrete Cosine Transform (DCT)

The A436 chip 10 can compute 2-D DCT's very quickly. Several optimized instructions are provided for this purpose. Input data in a variety of formats can be handled without requiring any additional computation time. Many (1,024 16b) windowed registers are available in the Parallel Processing Unit 20 to store the coefficients. Numerous (twenty 32b) registers are available in the scalar arithmetic unit 26 to store pointers and strides, or offsets.

The following is assumed: a) All registers have been initialized with coefficients, address pointers and strides. One copy of the coefficients is needed for the first dimension, requiring 2 of the 32 windowed registers 22 A. Eight copies of the coefficients are needed for the second transformation to avoid doing a matrix transpose, requiring 16 of the 32 windowed registers. If a matrix transpose were done, then no additional sets of coefficients would be required. b) An 8 x 8 matrix of pixels is read from an input image buffer. Each row of 8 pixels is read from a different series of locations in memory; typically, a Parallel DMA Port is programmed to allocate a particular number of bytes to each scan line of an image. c) A scalar register is allocated for each scan line as a pointer to the pixels to be read. d) One additional scalar register is allocated to store the horizontal "stride", which is 4 for packed bytes (pixels are adjacent to one another in memory) or 8 for interleaved bytes (pixels alternate with values that are not used at the moment). e) The first matrix transformation is computed, each sum-of-products is computed to 32 bits and stored in a 32b accumulator. f) The 64 results are moved from the 32b accumulators to two 16b "fixed" (not windowed) or temporary registers, F0 and FI. A bit-realignment is done as the data is moved so that the 16 "useful" msbs of each 32b result are kept for use in the second matrix transformation. g) The second matrix transformation is then computed, without needing to do an explicit matrix transpose or memory access as the calculations can be performed within the vector registers. Each sum-of-products is computed to 32 bits and stored in a 32b accumulator. The 16 msbs of the result are the 16 msbs of the accumulators, so no bit-realignment of the result is required.

The following instructions are used:

1. First matrix transformation, reading 8b operands from memory: "v_mvm_oqxyz WRi, ACj s_add SRm, SRn" then "v_mvmadd_oxyz

WRi , ACj s_add SRm, SRn" per row x 8 rows = 16 instructions, each of 8 results goes into a different accumulator, AC7..AC0, producing 64, 32b results, where " oqxyz" specifies the desired parallel data type: "o" specifies octal,

"q" specifies quad, "x" = "b" for byte or "w" for word (byte is assumed), "y" = "p" for packed or "i" for interleaved (as found in raw YUV or sequential RGB video data), and "z" = "s" for signed or "u" for unsigned (words are always signed).

2. Bit-realignment - if one of the inputs is a byte, not a word, then multiplying a 8b byte times a 16b word results in a shift of the binary point, which must be corrected; only two instructions are needed (one for all AC(even) and the other for all AC(odd)) to adjust ALL 64 results and move the results to temporary "fixed" (not windowed) registers, F0 and FI : "v_ror_#shif tright accumOL, F0" and "v_ror_#shiftright accumlL, FI".

3. Second matrix transformation, reading the 16b operands being transformed from "fixed" (not windowed) registers F0 and FI : "v_mvm_oqwps

F0 , WRi , ACj s_add SRm, SRn" then "v_mvmadd_oqwps

FI , WRi , ACj s_add SRm, SRn" per row x 8 rows = 16 instructions, each result goes into a different accumulator, producing 64, 32b results.

It should be noted that the total is only 34 instructions (340 ns at 100 MHz CPU clock) to compute an 8 x 8, 2-D DCT to 16b precision, including reading adjacent or alternating operands from memory, sign- or zero-extending byte operands to 16 bits, and an intermediate matrix transpose and bit re-alignment. Typically, the next step would be to quantize the results, which can be done while the results are still in registers. All 32 multipliers can be used independently to do this, rather than forming sums-of-products as was done above.

The first dimension of the DCT is shown in section A above. The second dimension the DCT is performed as follows using the registers of the parallel arithmetic units 22, rather than writing a transpose to memory. The matrices used in the example are:

[Transposed Frequency Matrix "Txy"] X [Coefficient Matrix "Dxy"] = [Final Frequency Matrix "Gxy"]

The following abbreviations are used:

ACx = accumulator "x", x = 7..0, in each row of the Parallel Processing Unit 20

DST = destination, an accumulator in each row of parallel arithmetic units 22 in the Parallel Processing Unit 20

FRx = Fixed Register "x", x = 2..0, in each parallel arithmetic unit 22 in the Parallel Processing Unit 20

SRC = source, a register in each parallel arithmetic unit 22 in the Parallel Processing Unit 20

WRx = Windowed Register "x", x = 31d..0, in each parallel arithmetic unit 22 in the Parallel Processing Unit 20

The following notation is used:

Instruction 1 : v_mvm_oqwps FRO , WRO , ACO reads the first four transposed F's (T's) in all rows and multiplies them times the first four rows of coefficients, which are the same in all rows, computing the first row of partial sums-of-products (S's) storing them in one of the accumulators, e.g., ACO, in each row of the Parallel Processing Unit 20:

Instruction 2: v_mvmadd_oqwps FR1 , WRI , ACO reads the second four transposed F's

(T's) in all rows and multiplies them times the second four rows of coefficients, which are the same in all rows, computing the first row of sums-of-products (G's) storing them in one of the accumulators, e.g., ACO, in each row of the Parallel Processing Unit 20:

Instruction 3 : v_mvm_oqwps FRO , WR2 , AC2 reads the first four transposed F's (T's) in all rows and multiplies them times the first four rows of coefficients, which are the same in all rows, computing the first row of partial sums-of-products (S's) storing them in one of the accumulators, e.g., AC2, in each row of the Parallel Processing Unit 20:

Instruction 4: v_mvmadd_oqwps FRl , WR3 , AC2 reads the second four transposed F's

(T's) in all rows and multiplies them times the second four rows of coefficients, which are the same in all rows, computing the first row of sums-of-products (G's) storing them in one of the accumulators, e.g., AC2, in each row of the Parallel Processing Unit 20:

This process continues for an additional six pairs of v_mvm_oqwps and v_mvmadd_oqwps instructions to complete the second dimension of an 8 x 8 DCT.

Note that sixteen windowed registers (WRI 5.. WRO) are used to redundantly store the coefficients. The order of these coefficients could be changed to place the results in registers in such a way as to make it easier to read out the results for presentation to a Huffman encoder:

v_mvm_oqwps FRO ,WR4 ,AC4 ; row 3 (left half) of pixels v_mvmadd_oqwps FR1,WR5,AC4 ; row 3 (right half) of pixels v_mvm_oqwps FRO ,WR6 ,AC6 ; row 4 (left half) of pixels v_mvmadd_oqwps FR1,WR7,AC6 ; row 4 (right half) of pixels v_mvm_oqwps FR0,WR8,AC1 ; row 5 (left half) of pixels v_mvmadd_oqwps WR9 , ACl ; row 5 (right half) of pixels v_mvm_oqwps FR1,WR10,AC3 ; row 6 (left half) of pixels v_mvmadd_oqwps WR11,AC3 ; row 6 (right half) of pixels v mvm oqwps FR0,WR12,AC5 ;row 7 (left half) of pixels v mvmadd_oqwps FR1,WR13,AC5 ; row 7 (right half) of pixels v_mvm_oqwps FR0,WR14,AC7 ; row 8 (left half) of pixels v mvmadd oqwps FR1,WR15,AC7 ; row 8 (right half) of pixels

C. Matrix Transpose

Matrix transpose is a common imaging operation. It is used in 2-D DCT's, 2-D convolutions, wavelets and elsewhere.

A variety of instructions is provided to move data between memory and the parallel arithmetic units 22. Data can be moved into specific rows or columns of parallel arithmetic units 22. As many as 16 bytes of data can be moved in a single instruction, matching the width of the non-aligned Data Cache 11.

Loads and stores can be performed on a variety of parallel data types, which allow bytes and words, and packed and interleaved (alternating) operands to be handled.

Data in memory can be loaded into registers in rows of the parallel arithmetic unit 22, then the data can be stored back into memory from columns of the parallel arithmetic unit 22. This forms a matrix transpose.

In addition, common imaging functions such as 2-D DCT's, which have an intermediate matrix transpose, can be implemented without any intermediate movement of data to memory to compute the transform. This can be done, as explained in the previous examples, by performing the first dimension of the transform by reading operands from memory, which broadcasts them in space, i.e., across multiple parallel arithmetic units 22, to the parallel arithmetic units that need them. The results of the first transform are then bit-aligned if necessary and moved to "fixed" (non-windowed) temporary registers, FR2..FR0. The second dimension of the DCT is then done by broadcasting those intermediate operands in time, i.e., in successive instructions, to the parallel arithmetic units that need them. D. Convolution

Convolution is a common imaging function. Multiple convolutions are implemented simultaneously. As many as eight convolutions can be performed simultaneously on byte data, or as many as four pairs of convolutions can be performed simultaneously on word data. Any size convolution can be performed. The most efficient sizes are a multiple of four; otherwise, unused coefficients can be set to zero. Two-dimensional convolutions can be performed by reading data from multiple rows of an image.

The instructions involved are: v_conv_x, v_convadd_x, v_conv2_x and v_c onv2 add_x.

Data from memory is fed to the parallel arithmetic units 22 in a staggered fashion to implement convolution. The fully non-aligned Data Cache 11 allows up to 16 bytes of operands to be fetched from memory at full CPU speed regardless of the address of the data.

E. Wavelet Processing

The computation of wavelets typically requires pairs of convolutions, one using 7 terms and the other using 9 terms. Multiple pairs of convolutions are implemented simultaneously. One performs a high-pass filter and the other a low-pass filter. As many as four pairs of convolutions can be performed simultaneously on word data. Any size convolution can be performed. The most efficient sizes are a multiple of four; otherwise, unused coefficients can be set to zero. Two-dimensional convolutions can be performed by reading data from multiple rows of an image.

The instructions involved are: v_conv2_x and v_conv2add_x.

Data from memory is automatically fed to the parallel arithmetic units 22 in a staggered fashion to implement convolution. The fully non-aligned Data Cache 11 allows up to 16 bytes of operands to be fetched from memory at full CPU speed regardless of the address of the data. Decimation of the filtered results can be performed using a combination of "packed" and "interleaved" parallel data types.

F. Motion Estimation / Pattern Recognition

Motion estimation and pattern recognition are common imaging functions. Motion estimation is a computationally intensive function in most forms of video compression to reduce temporal redundancy from the data.

Two classes of motion estimation instructions are provided, Staggered and Linear. The Staggered instructions check eight overlapping (staggered) sets of eight pixels each stored in memory against eight sets of pixels stored in registers. The Linear instructions check the same eight pixels stored in memory against eight sets of pixels stored in registers. The Staggered instructions enable one to simultaneously look for the same set of pixels in eight overlapping locations in memory, while the Linear instructions enable one to simultaneously look for eight sets of pixels in the same locations in memory.

A pair of instructions, v_PixDistY_x and v_PixBestY_x, performs this task very efficiently (Y = Staggered or Linear). Since multiple motion estimation calculations are performed in parallel, an entire 16 x 16 block can be handled in an average of only four CPU cycles (total of 40 ns @ 100MHz CPU clock).

The A436 chip 10 can implement any size of window from 1 x 8 pixels to 16 x 16 pixels in multiples of 1 x 8 pixels. Eight of these windows can be implemented simultaneously on overlapping sets of data from memory. The fully non-aligned Data Cache 11 allows the many bytes of operands to be fetched from memory at full CPU speed regardless of the address of the data.

The A436 chip 10 can provide the equivalent of more than 50,000 MIPS when using a 100 MHz CPU clock. Even so, the use of a hierarchical technique is recommended to reduce the number of calculations required. These hierarchical techniques can be implemented easily using a combination of the A436 chip 10 packed and interleaved parallel data types. G. Handling YUV and Sequential RGB Video Data

These common types of image data present alternating patterns in memory, where YUV 4:2:2 is commonly used with composite video data and sequential RGB is commonly obtained from color digital image sensors (details may vary from sensor to sensor, video encoder to encoder, and video decoder to decoder) as shown in Figure 10.

The A436 chip 10 "interleaved" parallel data types enable data to be directly read from memory or written to memory in these formats. For example, multiple red pixels can be read in a single instruction, ignoring the green pixels.

In addition, data can be read in an interleaved format and written in a packed format, or vice-versa. In addition, signed/unsigned format conversions and computations- can be performed upon the data at the same time that it is read in these formats.

H. Chroma Keying

The A436 chip 10 can perform 32 chroma keying operations simultaneously.

In chroma keying, one tests the color of a pixel and passes one value or another to the output based upon the color of the pixel tested. When YUV 4:2:2 video data is used, the color information (U and V) is spread across two pixels, which typically occupy four bytes of memory, as shown above.

If each byte is handled by a different parallel arithmetic unit 22, it is necessary to combine the results from multiple calculations to determine whether or not the color condition is satisfied. It is also necessary to communicate that result to multiple parallel arithmetic units 22 so they can work in unison to pass one pixel or another to the output.

Once a test is performed by the parallel arithmetic units 22, the scalar arithmetic unit 26 can read the vector zero register and broadcast it to the parallel arithmetic units 22 that need it. Since there are 32 parallel arithmetic units 22, the lower 16 bits of the vector zero register can be broadcast to the scalar broadcast registers 22F in the first 16 parallel arithmetic units, and the upper 16 bits of the vector zero register can be broadcast to the scalar broadcast registers in the final 16 parallel arithmetic units 22.

Each parallel arithmetic unit 22 then selects the bit or bits that it needs from its scalar broadcast register 22F, reflecting the state of its neighbors, and performs the keying.

The v_test_vcc instruction can be used to select which pixel is passed to the output by each parallel arithmetic unit 22.

I. Erosion and Dilation

In erosion and dilation, a 2-D array of binary pixels is tested. The center pixel may or may not be included in the test. The value of a new center pixel is determined based upon the pattern found in its neighbors.

The A436 chip 10 can compute 32 erosion and dilation functions in parallel. The parallel table lookup capability in each parallel arithmetic unit 22 is used with the local shift- specification capability to give a non-linear transformation between a 3 x 3 window of binary pixels and the resulting pixel.

Each parallel arithmetic unit 22 has the 32, 16b windowed registers 22 A, for a total of 512 bits or 2^Λ9. The vector index capability enables each parallel arithmetic unit 22 to compute a 5b address and use it to index into its 32 windowed registers 22 A. Thus, two 8b patterns or one 9b pattern can be tested by each parallel arithmetic unit 22.

This indexing capability is combined with the ability of each parallel arithmetic unit 22 to set the shift of its own barrel shifter 22E (see Figure 9). Thus 2-D addressing into a large array of bits can be done by each parallel arithmetic unit 22 by selecting a word then selecting a bit in the selected word.

First, the tables are initialized. Then, in each parallel arithmetic unit 22, the pixels of interest are combined into a binary number. This can be done by each parallel arithmetic unit 22 via a series of multiply-adds using binary weights.

A maximum of 16 bytes or pixels can be fetched from memory and used as operands by the multipliers 22D at one time. To achieve maximum performance, two sets of overlapping rows can be processed at the same time. The first four rows of the Parallel Processing Unit 20 would use one set of weights for first three rows of the image with the fourth row using weights of zero. The second four rows of the Parallel Processing Unit 20 would use zeroes for the weights for the first row of the image while a second set of weights would be used for the next three rows of the image.

Within a window for a valid range of pixels, the first pixel is given a weight of 1, the next is given a weight of 2, and so on. This value is used by each parallel arithmetic unit 22 to index into its windowed registers 22 A. Since there are 32 windowed registers 22 A in each parallel arithmetic unit 22, only the 4 lsbs of the value are used if 8 pixels are tested, or the 5 lsbs are used if 9 pixels are tested.

Then, the index is right-shifted by 5 bits and the result loaded into the vector index register. A vector rotate is then performed using the value in each parallel arithmetic unit to specify the amount of the shift. Bit 0 of the result is the desired center pixel.

J. Use of High Frame Rates to Improve Image Quality

The quality of images can be improved by acquiring multiple images in quick succession and combining multiple images into one. This technique can be applied to both "still" and continuous, or video, imaging when one or more image sensors are connected to an A436 chip 10, which controls them and receives data from them.

The high image acquisition and processing rate of the A436 chip 10 allows multiple images to be captured in very quick succession. The rate is limited only by the speed with which all or part of each image can be read from the image sensor or sensors that provides them. Parameters that control the acquisition of each image can be varied from one image to the next by the A436 chip 10 to provide differences in image quality from one image to the next.

For example, it is difficult to obtain high quality images when there is a wide variation from the darkest to the brightest portions of an image. This condition occurs when the subject is in the shade on a bright sunny day, at night when there are only point sources of light that do not uniformly illuminate the scene, or indoors when the subject is partially illuminated by sunlight through a window and partially by interior lighting. Such wide variations in brightness can span several to many orders of magnitude, making it impossible to accurately capture images with any film or image sensor.

This illumination problem can be solved by changing the integration period of the image sensor from one frame to the next, rapidly acquiring a series of images then combining the images into one. Each resulting pixel can be computed using a scale factor that depends upon the integration period of a given image, and by compressing the dynamic range of each pixel using a logarithmic function.

In order to obtain a true measure of the brightness of each pixel and thereby minimize the saturation of the brightness of a pixel due to its high brightness, the brightness of a given pixel from one exposure setting to the next can be compared to see if the value of the pixel scales linearly with the exposure period. If not, the pixel is either too bright, in which case the value is a maximum, or too dim, in which case the value is minimum. Since lenses produce flare, which is particularly objectionable and results in a bright region falsely illuminating a nearby region, the exposure at which a given pixel obtains it maximum value can be used to indicate the degree to which the apparent brightness of itself and its neighbors should be reduced.

Especially when image sensors have particularly high resolution, in which case it will take a relatively long period of time to read out the image, limiting the frame rate, the integration period for the sensor can be set to a nominal value, an image obtained, and the average brightness and brightness distribution can be computed. The sensor can then be commanded to output only those portions of the image where the brightness is particularly high or low. The integration period of the sensor can be modified and a series of partial images obtained, each with a different integration period. Those partial images can then be combined with the original full image, and the dynamic range of each pixel compressed, to obtain a better image than could be obtained by any single complete image.

Chapter 6: Software Development Tools

The following software development tools support the Ax36 family of video DSP chips 10:

1) ANSI-standard, parallel-enhanced macro C compiler, which produces an assembly-language output that is assembled into the object code, this assembly language output enables one to assess the code produced by the compiler and modify it if desired

2) macro assembler, assembles code written in A436 assembly language or produced by the C compiler

3) linker

4) loader

5) simulator (cycle-accurate)

6) debugger

7) device driver for Ax36 evaluation boards

8) download test program

9) effects controller test program

Chapter 7: Development of Reusable Code

The architecture of the A436 chip 10 supports the development of reusable application code and operating systems for the A436 by third party code developers. The code developers can provide relocatable object code modules, rather than source code, to systems companies who can combine this code with their own code and code from others to produce systems. This process is facilitated by the A436 chip 10 in at least three ways:

1) Jump and call addresses can be absolute or relative to the program counter. Absolute program addresses can be used for systems calls, and relative program addresses can be used within a code module. The object code module can thus be placed anywhere in memory without having to be relinked, and internal addresses can be kept private. Modules should be located on 64-byte boundaries.

2) An address pointer register ap is provided that assists in the allocation of memory. It can be used as a second stack pointer sp, but one which can be freely manipulated by a given code segment.

3) The A436 Linker allows multiple independently compiled or assembled object code modules to be combined into a single program.

A uniform method by which code modules from multiple sources can be combined into a single program is being defined.

Chapter 8: Built-in Debug Aides

To help detect erroneous program flow, illegal instructions are detected when:

1) a jump or call is made into the middle of a 32b instruction that is not in dual scalar format

2) an attempt is made to execute an invalid opcode

When this occurs:

1) the address of the first illegal opcode is immediately stored in an extended register

2) the corresponding instruction word is immediately stored in an extended register 3) the detection is indicated by setting a bit in an extended register (not cleared by reset) so that a second interrupt will be blocked; this bit can be cleared by software

4) a non-maskable interrupt request is made if the priority of the interrupt-on- invalid instruction is enabled

Illegal opcodes fall into the following groups:

1) unused opcodes in the cache and I/O group

2) unused opcodes in the v_mvm group

3) unused opcodes in the v_mvmadd group

4) unused opcodes in the Load and Store group

5) invalid scalar register B address is selected for scalar ALU 26D operations

6) invalid vector register B address is selected for vector ALU 22C operations

In addition, the Sync pin is asserted on a cycle-by-cycle basis when any of the following conditions occur, an interrupt request is made on the Debug interrupt channel, and any further interrupt requests on the Debug interrupt channel are blocked until re-enabled by software:

1. masked program address matches the target value, as indicated by PC Target Extended Register, PC and PC Mask Extended Register, where match = all bits "n", n = 25..1, true for [(PC(n) AND Mask(n)) XNOR Target(n)].

2. masked instruction matches the target value, as indicated by Instruction Target Extended Register, Instruction Cache b31..0 and Instruction Mask Extended Register, where match = all bits "n", n = 31..0, true for [(ICache(n) AND Mask(n)) XNOR Target(n)].

3. masked memory address matches the target value, as indicated by Memory Address Target Extended Register, Data Cache Memory Address and Memory Address Mask Extended Register, where match = all bits "n", n = 25..0, true for [(Data Cache memory address(n) AND Mask(n)) XNOR Target(n)]. 4. masked memory data matches the target value, as indicated by Memory Data

Target Extended Register, Cache b31..0 and Memory Data Mask Extended Register, where match = all bits "n", n = 31..0, true for [(Data Cache(n) AND Mask(n)) XNOR Target(n)].

Applications:

The ensuing description provides a number of applications using the power of the A436 chip 10. While disclosed in the context of the A436 chip, it should be realized that for at least some of these applications the inventor's earlier A236 and A336 chips may be used as well. It should be further realized that the following applications are exemplary, and are not intended to be read in a limiting sense upon the practice of the teachings of this invention.

In general, this portion of the specification describes a family of CMOS monochrome optical image sensors for use in a family of small, very low cost, low power, optical fingerprint sensors. An overview of a family of fingerprint sensors using these image sensors is also provided.

These fingerprint sensors result from an optimization of the total system design, including optical design, processor chip design, sensor chip design, algorithms, human interface, manufacturing and test, of the fingerprint sensors. They use a CMOS image sensor that works together with a miniature optical and illumination system and an image processor chip (embodied in the A436 Chip 10) that is optimized for battery-powered fingerprint and video/image compression applications.

Unlike other types of optical fingerprint sensors, this family of fingerprint sensors is specifically designed to be rugged, very low cost, light in weight, and to have low power dissipation for operation from batteries. The sensors are intended for being embedded into small devices such as door locks and power tools and operate autonomously without the need for any external computer power. In addition, these sensors will have very fast operation, be available in several configurations and form factors to provide the optimum human interface in a wide range of applications, and be able to operate out of doors. To provide maximum security, they can verify the use of live fingerprints and have the ability to work with adults and children.

Power management is very important in the design of these sensors because they are intended for use in battery-powered systems. Pixel-monitoring capability is provided in the image sensors so that software can run on an A436 image processor to adaptively set parameters in the sensor. This continuously compensates for changes in battery performance, optimizes image quality and minimizes the energy used to operate than external illumination system, which is controlled by the image sensor. In addition, a sleep mode is provided to give virtually zero quiescent power dissipation for long battery life.

Three families of fingerprint sensors are provided to optimize the cost, human interface and the amount of space required to build the sensor into a system. The "moving finger sensor" enables a series of partially overlapping images to be acquired and combined into a composite fingerprint image as a finger is rapidly moved over the sensor. A long, very narrow image sensor is used with a 1 : 1 optical system, providing a shallow implementation for use where cost and space are at a premium. The "static partial finger sensor" also has a shallow implementation. It captures a larger but still narrow slice (nominally 0.6" long x 0.1" wide) of a static fingerprint image with a 1 : 1 optical system. It is intended for use in portable systems such as power tools where little space is available and only moderate security is required. The "static (whole) finger sensor" acquires an entire fingerprint image (nominally 0.6" square) at one time without any motion of the finger. A small rectangular image sensor is used with approximately a 5: 1 optical system. It is intended for use in high security systems such as door locks.

By exploiting the image processing power of the A436 Chip 10 and algorithms developed for it, the entire system design of the fingerprint sensor can be optimized to reduce size and cost, and to improve reliability. For example, rather than tightly controlling the individual components and assembly of the optics and illumination system, which would be expensive, each fingerprint sensor can be tested at the time of manufacture. It can store measured illumination values in a non-volatile memory, which can be used to compensate live fingerprint images for the particular illumination pattern obtained. In addition, rather than building a fragile and expensive glass optical system with many components, a novel one-piece molded plastic compound lens-and-prism has been designed. This one-piece design provides low cost, rugged operation, self-alignment and self-compensation for thermal expansion/contraction. And, to provide maximum security, these sensors support a method of detecting the use of a "live" finger so that a rubber or plastic replica of a fingerprint cannot be used to operate the system. Furthermore, the family of sensors can be built with sufficient spatial resolution of the fingerprint images so that children's' small fingerprints can be resolved adequately.

Finally, to facilitate the building of fingerprint sensors into doorknobs and other curved objects, I have also devised a way to provide a curved surface (see Fig. 16B) rather than the normal flat surface (Fig. 16 A), that one touches with his/her finger.

Note: To avoid confusion, "fingerprint sensor" refers to the entire fingerprint capture unit while "image sensor" refers to the CMOS image sensor within the fingerprint sensor.

Fingerprint Sensor Architecture

This section gives a top-level overview of the optical design of the entire fingerprint sensor. The first part of the discussion focuses on a sensor with a flat surface that one's finger touches, while the second part of the discussion focuses on a sensor with a curved surface that one's finger touches. The third part of the discussion focuses on a method for verifying that a live finger, rather than a rubber or plastic replica, is being used with the sensor.

Basic fingerprint sensor with flat sensing surface

For ruggedness and low cost, a one-piece, scratch-resistant, molded red plastic, compound optical element will be used. There are three parts to this optical element, an input lens, a prism and an output lens. The input lens is an aspheric lens on one leg of the prism that collects light from LED's and images it onto its flat, top surface where a finger is placed. Light from the LED's is reflected from the top surface according to the critical angle out the second leg of the prism into a second aspheric lens, the output lens. The output lens images the reflected light onto the image sensor. A self-aligning and temperature- compensating design is used to handle the much higher coefficient of thermal expansion of plastic compared to glass, from which typical optical fingerprint sensors are made.

When no finger is present, the target brightness variation across the image sensor is less than 2: 1 from the brightest spot to the darkest spot. The presence of a finger on the prism reduces the amount of light that falls upon the image sensor by absorbing light wherever a ridge of a fingerprint touches the prism.

The light path in the fingerprint sensor is:

1) Illumination system: Multiple strobed surface-emitting, high brightness, high efficiency red LED's in a one- or two-dimensional array (depends upon type of image sensor) provide a spatially non-uniform but relatively narrowband light source that feeds into a diffuser.

2) Light from the diffuser passes through the input lens of the compound optical element, reflects off the top surface of the prism according to the critical angle and passes through the output lens to the image sensor.

3) A CMOS image sensor acquires the fingerprint image.

There are two dot pitches for the fingerprint sensor (pitch is referenced to the finger):

1) Standard Resolution - 500 dpi for adult use

2) Increased Resolution - up to 1,000 dpi or more for use by both children and adults

Three types of fingerprint sensors with two human interfaces are supported from two basic configurations of the fingerprint sensor:

1) Static (whole) finger sensor - The entire image of a fingerprint is captured at one time when the finger is placed upon the flat, top surface of the prism. For minimum size and cost of the image sensor, a reducing optical system can be used in both X and Y dimensions of the image sensor. A prism directs light from the LED's through a first aspherical lens to the finger and from there through a second aspherical lens to the image sensor. Along the length of the prism the length of the second lens gives an optical reduction of 5: 1 to produce a pixel that is 10 um long. Across the width of the prism the triangular shape of the prism produces an optical reduction of 1.4: 1. The width of the second lens gives an additional optical reduction of 4: 1 to produce a net optical reduction of 5.6, producing a pixel that is 8.9 um wide. The number of rows and columns in the sensor is approx. 300 to capture a 0.6" x 0.6" image of the fingerprint at 500 dpi. The nominal size of the photodiode array is thus 300 x 10 um long x 300 x 8.9 um wide = 3.0 mm long x 2.7 mm wide = 8.1 sq. mm.

2) Moving finger sensor - A finger is swept over the fingerprint sensor to capture an entire fingerprint image. For minimum depth of the optical system, a 1 : 1 optical system is used along the length of the prism and image sensor. The physical length of the photodiode array in the image sensor is thus the same as the width of a finger, 600 mils. Thus the pixel pitch across the width of the finger, i.e., along the length of the photodiode array, is 2 mils to give 500 dpi sampling. The number of rows in the image sensor is nominally ten. An optical reduction of 1.4: 1 across the width of the prism produces a row pitch of 2/1.4 = 1.4 mils. The number of columns in the image sensor is approx. 300 to capture a 0.6" wide fingerprint. The nominal size of the photodiode array is thus 15 mm long x 0.35 mm wide = 5.25 sq. mm.

3) Static partial finger sensor - Some applications, such as power tools, do not require as high an accuracy as security applications, and have limited space. In this case an image of a single band through the center or "core" of the fingerprint may suffice. These applications can use a long narrow sensor and 1 : 1 optical system like the one for the moving finger sensor, but with more rows in the image sensor to capture a somewhat larger image so that the finger can be statically placed on it.

Fingerprint sensor with curved sensing surface Background: It is very important to design everyday objects such as door locks in such a way that they can be used easily. The same is true of guns, which are everyday tools to some people such as peace officers. When electronics are added to such everyday devices, the new device should offer increased utility and convenience compared to the one it replaces. One should not have to stop and think about how to use something that one has used previously and is familiar with. The "human factors" of such everyday objects are thus very important.

For example, a person who feels threatened will wish to enter the safety of their home quickly. When the technology described herein is built into a doorlock on a door to the home, one could enter one's home quickly and simply by grasping the doorknob with one hand, turning the doorknob, pushing open the door that was locked only a moment ago, and entering the home. The door would then immediately become locked again to keep out the perceived threat. Using the technology described herein one would not need a physical key or a combination to enter the home..

To enter the locked home, one would simply place one of more fingers upon a fingerprint sensor within a doorknob. This doorknob, when used in locales that are bitterly cold in winter, would have a low thermal mass so that one's hand does not stick to it. A fingerprint sensor with a curved surface and low thermal mass, as described herein, is built into the side of a doorknob. The fingerprint images from this sensor are captured and processed by a very fast but low power and low cost image processing chip (A436 chip 10) that is adapted for rapid fingerprint acquisition and processing. As a result, the owner is enabled to enter through the locked door in a fraction of a second, even in the rain, in the cold dryness of winter and in the heat and humidity of summer.

Implementation

Traditionally, optical fingerprint sensors use a right-angle prism that is about one inch long. The largest surface is typically about the size of a fingertip. For purposes of explanation, imagine that the prism is inverted, resting on the edge that joins the two sides that are at right angles to one another. A broad beam of light is shown on the entirety of one of the two right-angle sides, the entrance side, nominally at right angles to that side to maximize the amount of light that enters the prism and to control the angle, the angle of incidence, at which it strikes the top or reflecting surface of the prism. (All angles are measured from the normal to the surface, measured inside the prism.) This beam illuminates the top flat surface (the hypotenuse of the triangle) of the prism from within the prism.

A finger may be placed upon the flat top surface of the prism, absorbing light where the ridges of the finger touch the prism. Using the principle of total internal reflection, light that is not absorbed by a finger or other object in contact with the prism is reflected from the flat surface of the prism and exits the other right-angle side, the exit side, of the prism. External to the prism, this reflected light is then imaged by one or more lenses upon an image sensor having a two-dimensional array of light sensitive devices, producing a signal that is manipulated electronically by a digital signal processor or like device to determine whose finger is upon the prism.

Often, the top surface of the prism contains a thin transparent electrode that is split into two parts. The objective is to have a finger complete an electrical circuit when the finger touches these two electrodes, signaling the presence of a finger so that processing can be performed. Unfortunately, this electrode is sensitive to static discharge, humidity, dirt, moisture and the conductivity of a finger touching it.

It is difficult to build such a prism into a curved surface for the simple reason that curves and flats do not fit together well. One is simply left with a flatted curve.

This problem can be alleviated by building an optical device whose reflecting surface is curved rather than flat where the finger touches it. It may be desirable to match the curve of the exposed surface of the optical device to the curve of the side of the doorknob, in which case the surface is convex, i.e., the curve sticks outward from the device. Since one's fingers typically grasp a doorknob at an angle to the axis of rotation of the doorknob, it may be convenient to position the long axis of the optical device at an angle to the axis of rotation of the doorknob, rather than being parallel to it. When a right angle prism is used, one attempts to illuminate the prism so that all of the light rays fall upon the reflecting surface at the same angle, in which case they will all exit at that same angle. So long as the angle is greater than or equal to the "critical angle", which is a well-known optical parameter and nothing but air is in contact with the reflecting surface, then all of the light will be reflected from the surface. The value of the critical angle is determined by the optical properties of the material from which the prism is made.

However, when a curved reflecting surface is used, one can make use of the fact that total internal reflection occurs whenever the angle of incidence is greater than or equal to the critical angle. Thus, rather than illuminating a right angle prism with light rays that are all nominally parallel to one another, we can illuminate the curved reflecting surface with a beam of light where the rays are not parallel to one another but have the property that each ray of light nominally strikes the reflecting surface at an angle that causes total internal reflection for that particular ray.

As a result, the side, the entrance side, of the curved optical device that has light entering it has light entering it from a variety of angles. Since the angle of incidence must be zero degrees for a beam of light to enter a surface without being reflecting from it, the side of the optical device where light enters it must also be curved for maximum optical efficiency. This factor is very useful because through proper selection of the curvature, a line of illumination, such as a one-dimensional array of light emitting diodes (LED's), rather than a more expensive two-dimensional array of LED's, can be used.

The use of a line of LED's assumes that the surface where the light from the LED's enters the optical device is curved in one dimension, in a plane that is at right angles to the long axis of the optical device. If it were desired to use only a single LED for illumination, then a surface that is curved in two dimensions could be used. In either case, the light source is placed at the focal point of the entrance side of the curved optical device.

Likewise, the surface, the exit surface, where reflected light exits the curved optical device must also be curved to minimize the amount of light that is reflected within the curved optical device from that surface. Unlike the entrance side of the curved optical device where it is often desirable to have a one-dimensional light source for illumination, a two-dimensional image must be formed on the exit side of the curved optical device. This image is the image of a fingerprint upon the reflecting surface of the curved optical device. Therefore a two-dimensional image sensor is placed in the exit field of the curved optical device at a position where a clear fingerprint image can be formed upon it.

In a conventional optical fingerprint sensor, a milky or other diffusing material is used in an attempt to make a more spatially uniform source of illumination than would be provided by a sparse array of LED's alone. Unfortunately, this diffusing layer creates light rays that enter the prism with a wide range of angles, providing a haze to the fingerprint image.

However, the use of a diffusing layer and/or uniform illumination source are not necessary when: (1) a sufficiently powerful signal processor acquires the image from the curved optical device, (2) the illumination pattern when no finger is present is measured and stored in a non-volatile fashion with the device when it is manufactured, and (3) that illumination pattern is subsequently used to compensate each image captured and processed.

Reference may be had to Figures 16A and 16B, wherein in Figure 16A is shown a body of a compound optical element with a flat fingertip 1 contact area 2, while Figure 16B shows the the body of the compound optical element to have a curved fingertip contact area 2'. Other illustrated components in these two Figures are the curved entrance surface 3 of the optical element, an illuminator 4, such as LEDs, a curved exit surface 5 of the compound optical element, the image sensor 6, an entering light ray 7, and an exiting light ray 8.

Method for verifying use of a live finger

Some applications of fingerprints to authenticate identity, such as to obtain physical access to homes, buildings, offices and hotel rooms, require a higher level of security than others, such as to gain access to one's own computer. This is because physical access implies that physical harm may come to the occupant if an unwanted intruder gains access, whereas unwanted access to a computer generally implies a lesser risk ~ the possible theft or destruction of intellectual property or financial resources.

Therefore, for the highest level of security, it is desirable to verify that a fingerprint is obtained from a live finger rather than from a replica.

Rather than capturing a single fingerprint image, this verification can be done in an optical fingerprint sensor by rapidly capturing a series of fingerprint images, either as the finger is placed upon the sensor or the finger is removed from the sensor, or both. An algorithm can be run in the processor receiving images from the sensor to detect changes in the appearance of the fingerprint as the pressure of the finger upon the sensor varies.

For example, a finger that is not in contact with a sensor may appear pink from the blood within the finger near its surface. As the finger is pressed against something, such as a fingerprint sensor of the type described herein, the blood is squeezed out of the tiny blood vessels in the finger and the finger become lighter in color. Inversely, the finger turns from light to darker as the finger is removed from the sensor.

The measurement of the color of the finger is important because the color reflects the flow of blood near the surface of the finger. In comparison, simply measuring the size of the fingerprint as the finger is pressed or removed from the sensor is not adequate because a soft rubber cast of a finger could mimic it.

Additional colors, such as blue, green and infrared, of LEDs can be interspersed among the red LEDs 4 should it be necessary to make a more complete assessment of the color of the fingerprint as the finger is pressed upon or removed from the fingerprint sensor. The fingerprint image sensor would have additional outputs to control these additional LED's. Information about the colors of the fingerprint can be stored as a part of the information describing each individual's fingerprint.

Fingerprint Image Sensor 6 Pins

One suitable embodiment for the pins on the fingerprint image sensor 6 are:

Digital Power, Vdd, and Analog Power, VddA, are nominally 3.3 v but may be 5 v if the development time and cost of the initial sensor are reduced significantly.

All signals except Reset are referenced to the rising edge of Clock. Reset is an asynchronous input. The pixel timing for LineSync, FrameSync and LightStrobe# are referenced to the data that is output on the data bus, Data[7..0].

LineSync is pulsed once per active line. It is asserted during the NumberOfActivePixels active pixels in each active row.

FrameSync is pulsed once per frame. It is asserted starting with the first active pixel in the first active line and disasserted immediately following the last active pixel in the last active line. LightStrobe is asserted according to parameters loaded into the control registers. It controls an external illumination system consisting of a series-parallel array of LED's. Each series combination of LED's has a single series resistor for load balancing. Rather than wasting power with a series resistor for each LED, as many LED's should be used in series as possible to minimize the amount of energy used to operate them. The LED's operate from "unregulated" battery power in which case the amount of current varies with the type of batteries used and the state of charge of the batteries. There are registers in the image sensor that indicate the number of saturated pixels in a frame so that an external processor can optimize the amount of time that LightStrobe is active.

The nominal speed for Clock is 5 to 10 MHz. Each clock pulse produces one pixel at the output of the sensor when data is being read from the sensor. At 5 MHz, a frame period is only about 20 mS.

A block-oriented protocol is used to write data into all of the registers in sequence starting with the first register or to read data from all of the registers in sequence starting with the first register, under control of Read/Write# and Mode[1..0]. The exact signaling of this protocol is t.b.d. A parallel bus has been chosen over a serial bus to avoid problems inherent in a serial bus.

All of the pins may be placed at one end of the image sensor die.

Fingerprint Image Sensor 6 Architecture

The fingerprint image sensor 6 may contain a rectangular array of rectangular photodetectors. All of the photodetectors integrate the optical input at the same time under control of a programmable, internally generated signal, Shutter. Each photodetector is connected to its own local analog storage element. Each of the photodetectors transfers its contents to its respective analog storage element at the same time under control of an internally generated signal, Transfer. This enables a snapshot to be taken of the entire array at one time and minimizes the effects of any external stray illumination. Anti- blooming capability is preferably provided. To minimize the complexity of the control logic in the image sensor, simple row- and column-select shift registers are used rather than large row- and column-decoder networks. Row-select logic within each photodetector chooses a row for output. Column- select analog multiplexers feed the output of the selected column, and thus of a single selected analog storage element, to a high speed, 10-bit A/D converter (ADC) with over- limit detection and saturation capability.

The ADC output is fed to a digital multiplier whose output is registered and output as Data[7..0]. When the illumination is low and the proper value is loaded into the gain control register, the effect is that the 8 lsb's of the ADC become the output at Data[7..0]. The gain control register is used as one input to the multiplier to scale the output of the ADC to provide full-scale operation of Data[7..0] for the brightest pixel. The optimum value to be stored in the gain control register is determined by an external processor. Regardless of the actual scaled value, the value output at Data[7..0] is saturated at full scale should an over-range value at the output of the ADC be detected.

When the optimum amount of illumination is used, the ADC should read full scale without saturation for the brightest pixel in a frame. Two registers are provided to count the number of times per frame that an over-range value is detected by the ADC. These registers enable an external processor to adjust the amount of illumination by varying the amount of time that the LED's are on and to set the proper value of the gain control register. One register counts the number of over-range pixels in a frame. This value is transferred to a second register at the end of the frame. This two-stage mechanism enables the over-range count to be read at any time during the next frame. In addition, if the "output over-range count" bit is set, the over-range count is output as two bytes immediately following the last active pixel in the last active row and the LineSync and FrameSync pulses are extended to encompass those two bytes.

Scanning of the image is programmable. The direction that the rows are read out is programmable under control of the "column scan order" bit. The pixels are read out in sequence across each row, from left to right or vice-versa, and from one row to the next to form a frame. The sequence that the rows are read out is programmable under control of the "row scan order" bit. The rows are read out in sequence from first to last or from last to first.

The Data[7..0], Read/Write and Mode[1..0] signals are used to program control registers that specify:

1) choice of scanning from left to right or right to left, and top to bottom or bottom to top

2) timing of Shutter and LightStrobe signals

3) gain

These signals are also used to access registers that enable an external processor to determine the number of over-range pixels to optimize the amount of time that LightStrobe is active.

The physical parameters that define the array of photodetectors are:

1) NumberOfRows (total number of rows, nominally 10 in Moving Finger Sensor, 50 in Static Partial Finger Sensor and 300 in Static (Whole) Finger Sensor)

2) NumberOfColumns (nominally 320, not all of them useful)

3) RowPitch (nominally approx. 2/1.4 mils in Moving Finger Sensor and approx. 8.9 um in Static Finger Sensor)

4) ColumnPitch (nominally 2 mils in Moving Finger Sensor and approx. 10 um in Static Finger Sensor)

5) NumberOfLeadingPixels (these reference pixels are black, nominally 8)

6) NumberOfActivePixels (nominally 304)

7) NumberOfTrailingPixels (these reference pixels are black, nominally 8)

8) PixelWidth (as close as possible to ColumnPitch to maximize optical sensitivity)

9) PixelHeight (as close as possible to RowPitch to maximize optical sensitivity)

Each row of pixels may be as follows:

The total number of pixels in a row is:

NumberOfLeadingPixels + NumberOfActivePixels + NumberOfTrailingPixels = 320. It should be noted that this total is by design an integer multiple, five, of 64. The use of correlated double sampling or a similar technique to minimize noise may be performed.

Fingerprint Image Sensor 6 Control Registers

These registers can be read and written from the data bus, Data[7..0]. A spare register can be provided to give an even number of registers.

The control registers are (lsbs at low address, msbs at high address):

Note 1 : The bits are used to decode the row address. Turn on is at the beginning of the selected row. Turn off is at the end of the selected row.

Fingerprint Image Sensor 6 packaging preferably uses a chip-scale packaging technique with a clear glass cover.

Fingerprint Image Sensor 6 Testing

Full laser-scan testing of the image sensor 6 can be done at the time of manufacture to ensure that all pixels are good and that their sensitivity is uniform. The finished fingerprint sensor is preferably tested without any finger contact so that the uniformity of the illumination system can be measured by the sensor, and the resulting illumination pattern stored in the non-volatile memory within the sensor. Fingerprint Coding

The problem of authenticating one's identity is a major problem. The use of biometrics, such as fingerprints, may be divided into two classes of applications, referred to herein as "public" and "private". By "public" is meant the authentication of one's identity in applications such as a point of sale, pay telephone, hotel, car rental and airline ticketing. These applications are likely to have wide area or Internet communications capability. By "private" is meant the authentication of one's identity in more personal applications such as gaining access to one's home or computer, or using a fingerprint-enabled gun. Most of these applications would lack such a wide-ranging communications medium.

A critical problem is how to authenticate one's identity in public applications. It is not convenient to have to obtain and carry a smart card or other identity card that bears your fingerprints or information about them, e.g., minutiae lists, and it has not yet been possible to store information about a fingerprint in the limited storage capacity of the magnetic stripe on a credit card.

It is also not yet safe to store critical identity information on the Internet. If any sort of detailed information about one's fingerprints were stored on the Internet, then that information would be a proxy for the owner and with that information anyone could pretend to be another person, resulting in identity theft.

It is thus desirable to provide a truly global system that can be used to authenticate the identity of anyone who has made their identity known to the system. The Internet provides a means for providing such a global system. By having fingerprint capture, processing and communication devices as disclosed herein located throughout the world, wherever it is desirable to authenticate one's identity, it will no longer be necessary to remember PIN codes or passwords, and it will not be necessary to carry a smart card or other form of identity card. There will be nothing to remember or forget to be able to authenticate one's identity quickly, cheaply and reliably.

Basic Improved System for Authenticating Identity It is common in fingerprint systems to extract a series of identifying marks or minutiae points from a fingerprint and to form a list of the marks. This "minutiae list" typically includes information about each type of mark, such as whether it is a ridge ending or a point where several ridges come together, and its location in an X-Y coordinate system that encompasses the entire fingerprint. The total amount of data to represent a single fingerprint in this representation is typically several hundred bytes.

This representation is very inefficient from an information theory point of view. This is because there is a large but limited number of humans and thus human fingers on Earth. This number is much, much less than can be represented by the hundreds of bytes that comprise a typical minutiae list. For purposes of estimation, if there are five billion people on Earth then there are at most fifty billion fingers to deal with. Only 36 binary bits, i.e., 4-1/2 bytes, can ideally represent fifty billion fingers. We thus have an inefficiency of nearly 100 to 1 in the current schemes to represent a minutiae list.

One of the reasons that a minutiae list is so large is that the X and Y coordinates of each minutiae point are stored. Since fingerprints are typically sampled at 500 dots per inch (dpi), and the extent of a fingerprint is about one inch, one needs 9 bits for each of the X and Y coordinates for each point. This representation is problematic because there is no fixed point of reference for the coordinate system and because the size of fingers changes with growth, especially during childhood, and with humidity and perspiration, among other factors.

This standard X-Y coordinate system does not reflect the fact that a fingerprint is a topology that is fixed at birth, save for damage to the fingerprint. In fact, since the flow of a fingerprint matches the shape of a finger - roughly parallel to the sides of the finger, rounded at the tip of the finger and horizontal at its base, i.e., parallel to the joint between the finger tip digit and the middle digit of the finger, a better method of representing a fingerprint is to use a modified polar coordinate system.

This modified polar coordinate system is centered on the core of the fingerprint. The unit of the system is not an arbitrary physical dimension, such as 1/500 inch corresponding to 500 dpi, but is a count of the number of ridges that a minutiae point is away from the core of a particular fingerprint. A list of minutiae points is formed in terms of this modified polar coordinate system and includes a statement of the average ridge spacing in the fingerprint to set the scale.

A polar coordinate system needs a line of reference from which to measure angles. This line of reference is obtained by recording an image of not only the fingerprint but also the crease or joint between the outer and middle digits. Extra care is required to do this because the crease is recessed from the surface of the finger where the fingerprint ridgelines are recorded.

When a finger has more than one crease line, the crease line nearest the end of the finger is chosen. If that crease line, or a single crease line is curved, a best-fit straight line is placed through that line and used as a reference.

To help position a finger in such a way as to capture the crease line, it may be convenient to have a fingerprint sensor with an edge that the finger can be bent over, positioning the crease along that edge. An example of this is a credit card reader found in many self- service gasoline pumps. A credit card is pushed vertically into a slot in the reader then pulled out of it. On either side of the slot for the credit card is a space for fingers to hold onto the credit card. These two spaces present "pockets" that a finger can be bent over slightly by placing the crease of a finger on the edge of the slot. A fingerprint sensor may be positioned in one of these two pockets, at the outer edge of the pocket so that a finger could easily be placed upon it and the edge of the pocket.

Once the crease line is obtained or used, a line is extended at right angles to it and positioned laterally across the fingerprint so that it extends vertically through the center of the core of the fingerprint. This line thus splits the fingerprint into two portions, a left portion and a right portion. A second line is extended at right angles to this first line, being positioned along the length of the first line so that is passes horizontally though the center of the core of the fingerprint.

The minutiae points in the fingerprint are now recorded in terms of this modified polar coordinate system. Ridges are counted from the center of the coordinate system, which is located in the middle of the core of the fingerprint. To avoid confusion, the list created may be referred to as a topology list.

A very small number of bits can now be used to record information that encodes the minutiae points into a topology list. As will be shown, it is particularly convenient if no more than 80 bits or 10 bytes are used to record the entire minutiae list.

The method for encoding the minutiae list into an 80-bit (or any other number) topology list is:

1) Go to the center of the coordinate system.

2) Set the 80 bits of the topology list to all 0's.

3) Move to the left along the horizontal axis (orthogonal to the length of the finger) of this coordinate system.

4) Stop at the first ridge line that crosses the horizontal axis.

5) Traverse the ridgeline in a clockwise fashion. If the ridge line is continuous and does not intersect any other ridge lines by the time it returns to the right hand extent of the horizontal axis, record a value, 00 (binary) for the current two bits of the topology list. If the line terminates, record another value, 01 (binary), instead. If the line splits, record another value, 10 (binary), instead. If the line is joined by another line, record another value, 11 (binary), instead. If all 80 bits of the topology list have been filled, then stop, otherwise move on to the next two bits of the topology list and continue.

6) Return to the intersection of the current ridgeline with the left extent of the horizontal axis. Traverse the ridgeline in the counterclockwise direction, encoding the ridgeline in the same way as was done when a clockwise traversal was done. If all 80 bits of the topology list have been filled, then stop, otherwise move on to the next two bits of the topology list and continue.

7) Return to the intersection of the current ridgeline with the left extent of the horizontal axis. Move to the left along the horizontal axis and stop at the next ridgeline that crosses the horizontal axis. If there are no more ridgelines then stop, otherwise go to step 5.

8) Stop

This method records up to 40 minutiae points in only 80 bits by using only 2 bits per point. Eighty bits have been chosen simply because the number is so large that it would take a prohibitively long time to try to decrypt a message that is based upon it by brute force.

Once the topology list is complete, it is input to an algorithm that computes both a public key and private key from it. One suitable algorithm for computing public key / private key pairs in a small fraction of a second on a fast parallel processor such as the A436 Chip 10 is available from NTRU Cryptology, Inc., in Providence, RI.

Once the private key and the public key have been computed, a random message is created or otherwise obtained and encrypted using the private key. This encrypted message is then sent via the Internet or other means to someone or a computer that desires to authenticate the identity of the person whose fingerprint has been obtained and processed. Additional information about the person is also sent, such as the purported identity of the person and their credit card information.

The recipient of the information accesses a table of public keys. This table is readily available to everyone for reading via the Internet or other means and contains the public key for everyone with whom the recipient desires to communicate. In particular, the recipient obtains the public key for the person who sent the message.

The recipient then decrypts the message using the public key. The decrypted message will be intelligible only if the sender of the message was who he/she claimed to be as evidenced by the fact that the proper private key was used that matched the readily available public key. The recipient then sends a message back to the sender acknowledging that the sender's message has been decrypted successfully and that his/her identity is deemed to be authentic.

It may happen that the algorithm for producing the topology list does not produce the same list every time. This could happen as a result of a slight shift or rotation in the coordinate system that is used to produce the topology list. Therefore it is desirable that a small number of public keys be associated with each user and that all of them be tried in turn to try to decrypt a message if necessary.

Enhanced System using Feedback for Authenticating Identity

The method given above to produce a short topology list relies upon the ability to reliably devise the same coordinate system for a given fingerprint. In general, this approach is analogous to current algorithms that capture and process a fingerprint autonomously, i.e., without any external assistance or a-priori knowledge of the particular fingerprint.

However, the use of the Internet or other wide area communications medium provides a new vehicle for authenticating the identity of a customer in a "public" application such as a point of sale terminal, pay telephone, hotel, car rental or airline ticketing. In comparison, "private" applications such as gaining access to one's home or using a fingerprint-enabled gun would lack such a wide ranging communications medium and have to handle a fingerprint without any outside assistance.

It is thus my intention to use the Internet or the like to obtain "hints" about the best way to process a given fingerprint. These hints must be sufficiently small that they in no way risk the loss of one's identity by becoming widely known.

My objective is to derive a stable and repeatable coordinate system from a given fingerprint so that a short topology list can be created reliably from that fingerprint. Then, with that topology list my objective is to authenticate the identity of the individual. Part of the basic idea is that a triangle can be used to define a coordinate system. Two points of the triangle define the baseline of the triangle, which is the horizontal axis. A third point located away from the line running through the first two points defines the vertical axis, which is orthogonal to the horizontal axis and passes through the third point. The intersection of these two lines is the center of the coordinate system. In a fingerprint, I use a small number of pixels in a two-dimensional area or array to define a region. The number of pixels in this small area is large enough that the area can be located in a given fingerprint of a given individual, given that the relationship of the three areas to one another is known, but small enough that it does not take much data to represent it and does not compromise the individual's identity. One point of the triangle is situated uniquely within each small area, such as by being at its center.

My method for deriving a repeatable coordinate system for a given fingerprint and authenticating one's identity via the Internet is:

1) A "fingerprint device" is provided that can quickly capture and process fingerprints, compute private key/public key pairs rapidly, encrypt and decrypt messages, and communicate with distant computers via the Internet or other means.

2) A global or other wide area identity authentication system is provided that stores credit card and/or other information about many customers in a secure fashion, and has the ability to access a publicly available table of information that includes one or more public keys for each customer plus a limited amount of information that is representative of a very small fraction of each customer's fingerprints.

3) When one desires to enter or enroll information about him/herself into the global identity system, he/she identifies him/herself via a credit card or other means and places his/her finger on the fingerprint device. The fingerprint device captures the fingerprint and locates a number, typically three, of small regions in the fingerprint that can serve as reference points for a coordinate system for the fingerprint. These regions are preferably so small that they cannot be used to reconstruct the fingerprint and are likely to occur in the fingerprints of others. An iterative procedure is used to optimize the selection of these small regions that serve as reference points. Information representative of these small regions is conveyed to the publicly available table of information along with the newly created public key. This process is repeated for each fingerprint that the customer desires to use to authenticate his/her identity.

4) When one who has registered him/herself with the identity authentication system desires to prove his/her identity, he/she places one of his/her fingers on a fingerprint device as defined herein and declares his/her identity using a credit card or other common means. The claimed identity is sent to the identity authentication system, which returns the "hints" about the fingerprints, i.e., the information about the small regions in the various fingerprints that have been entered into the system. The fingerprint device then tries to locate the corresponding small regions in the current fingerprint. Since the live fingerprint is likely to be rotated and translated compared to the fingerprint from which the hints were derived, a substantial amount of computation may be required. In addition, it may be desirable to scan the fingerprint at higher than normal resolution so as to be able to have sufficient information to match the areas precisely.

5) If no match is found, the identity authentication fails. If a match is found, then the entirety of the live fingerprint image is rotated and translated so that the hints match well and a new topology list can be created.

6) The method described above for authenticating the user's identity is then performed.

The operation of this method can be sped up by caching, or storing locally, the information that is required to authenticate the identity of customers who are likely to be requesting the authentication of their identity in the near future.

Image Processor Chip Optimized for Battery-Powered Fingerprint and Wireless Image Compression Applications

Increased Parallelism The performance of the A436 Processor Chip 10 is increased several ways, one of the ways being the use of increased parallelism within it compared to the A236 Chip., as has been described in detail above.

For example, each motion estimation unit or coprocessors 24 in the A436 chip 10 has its own set of control registers, analogous to the single set provided in the A236. These motion estimation coprocessor 24 can be used for many operations, not just motion estimation. They can provide very high performance pattern matching which is useful in fingerprint applications to extract minutiae points, and they can provide alignment functions in scanners where it is necessary to line up the pixels from different color planes.

As an additional step, eight ALUs can be built into each processor of the A436, with the result being returned to a register within each bank of eight registers, to provide additional capability. One class of instruction would perform one operation a time in each parallel processor, drawing data from a single pair of registers. A new class of instructions with increased parallelism performs eight operations at a time in each parallel processor, with each operation operating on data within the registers in each bank of eight registers, or drawing a common operand from memory.

Increased Connectivity

An additional speedup would be provided to enhance the performance of the matrix transpose operation, which is critical to the implementation of two-dimensional discrete cosine transforms (2-D DCT's). This would is provided by implementing a new set of interconnections among the registers of the parallel processors 22 so that one parallel processor 22 can access data from another parallel processor 22 in a variety of configurations. A new set of "read only registers" would be provided within the parallel processors so that a series of reads would access the data from a series of registers in the parallel processors in such a fashion that a matrix transpose operation could be performed quickly.

Encryption/Decryption Capability To protect programs and critical data that are stored in a Flash EEPROM or other nonvolatile memory outside of the A436 Chip, where they potentially could be copied, an encryption feature may be provided in the A436 Chip 10 and systems using it.

The method for protecting sensitive programs and data is as follows:

A 32-bit "encryption" (actually decryption) register is built into the Scalar Processor or arithmetic unit 26 in the A436 Chip 10. Data from the Instruction Cache 28 is sent to this register whose contents can be written back into the Instruction Cache 28. A set of feedback terms may be permanently built into this register when the final metal layer is applied to the A436 Chip 10 during fabrication. This metal layer can be customized such that each customer can have A436 Chips 10 that are unique to him or her.

To ensure privacy, a program running on the A436 Chip 10 cannot determine the feedback terms built into this encryption register, nor can they be seen when the bare die of the A436 Chip is viewed.

The software tools for the A436 Chip 10 may be configured so that a customer can enter his/her encryption code into the tools, and the tools will encrypt programs using it. The encrypted program is then stored in the non- volatile memory outside of the A436 Chip 10. A program may then contain a short section of clear or unencrypted code, followed by the encrypted code.

When the A436 Chip 10 is reset, it reads the encrypted program from the non- volatile memory. The clear code would cause the A436 Chip 10 to read in the encrypted program, pass it through the encryption register and store it in embedded DRAM within the A436 Chip 10. A memory management capability would then be provided in the A436 Chip 10 to keep track of the locations in memory that have the decrypted program stored in them. Any attempt to pass the contents of these locations outside of the A436 Chip 10 would be blocked, and any attempt to access these locations except for an instruction fetch would be blocked. The embedded DRAM would be fully testable, however, by a program that does not pass any information through the encryption register before it is sent to the DRAM.

Sensitive data could also be passed through the encryption register and stored by the A436 Chip 10 in the external non-volatile memory.

Fingerprint Applications

Childproof Guns

Referring to Figure 15 several components that are added to a weapon (e.g., a handgun or gun 50) can be seen. The addition of these components is intended to restrict the use of the weapon 50 to only authorized person(s), e.g., to make the weapon "childproof.

1) The horizontal cylinder with a plunger is a "latching solenoid" 52 or other device that keeps the gun from firing until it receives a signal from the A436 chip 10 that it has recognized the owner's or authorized person's fingerprint(s). The plunger moves to enable/disable the gun 50 (normally disabled.) A part of the gun 50, the "hammer actuator", that moves horizontally when the trigger is pulled is not shown in the drawing. It would connect the trigger to the hammer of the gun. The solenoid should be vertical, not horizontal, so that its plunger can engage the moving piece to stop the gun from firing. The plunger would normally be out to stop the gun from firing.

2) The button 54 in the indented spot in the handle is a switch that turns on the fingerprint unit when the user wraps his/her fingers around the gun 50. The gun becomes disarmed the moment the user lets go of it, such as when it is dropped or knocked from the hand.

3) The horizontal image with the fingerprint represents the fingerprint of one's middle finger upon the fingerprint sensor 56. The sensor 56 is in line with the indentation 58 in the handle and the switch 54. The sensor is not necessarily shown to scale. The key point is that the sensor 56 is placed horizontally and is relatively long so the gun 50 can be used with people having fingers of varying length. The indentation 58 in the front of the handle, immediately below the trigger, helps the middle finger wrap around the handle and onto the sensor 56. 4) The square device, not drawn to scale, represents the A436 chip 10. The A436 chip 10 operates so quickly that by the time the one would grab the gun 50 and raise it into firing position (about 1/10 second), it would be ready to fire if you are authorized to use it.

5) The square to the right of the A436 chip 10 is the non-volatile memory chip 60 (nonvolatile means that it maintains its data even if power is turned off.) Chip 60 stores the fingerprint program and the fingerprints for authorized users.

6) The rectangle behind the sensor 56, A43636 chiplO and the non-volatile memory chip 60 represents a circuit board 62 that the circuitry is mounted on. It could be smaller so that its overall size is just a bit bigger than the three chips it holds.

7) One or more batteries 64 are preferably replaceable batteries that are mounted in the magazine, i.e., the removable component that holds the bullets (not shown).

Operation

The "hand" of the gun can be changed from right to left. This work would be done at the gun shop. Since most people are right-handed, guns would be shipped right-handed then made left-handed at the gun shop for those people who need them. This would be done by removing both handle grips from the gun and installing the fingerprint sensor in the right grip instead of the left grip.

An alternative location for the fingerprint image sensor is in the thumb rest of the gun.

Guns that must be able to be fired either left-handed or right-handed can have two fingerprint image sensors built into them. The fingerprint images from both sensors would go to a single A436 chip 10, which would select the appropriate image to process.

Normally, if an authorized user is using the gun 50 the gun will fire, but if an authorized user were NOT using it, the gun 50 would not fire.

The method for using the fingerprint-enabled gun 50 is as follows: the gun would operate automatically and instantly. The authorized user would simply pick up the gun, aim and fire. There would be nothing to remember or forget. No extra hand or finger motion would be required to use it. Once the gun 50 is enabled, the user could keep firing until the gun run out of bullets or the user lets go of the handle. The electronics operate so quickly that by the time the gun is raised into the firing position, it would be ready to fire if the user is an authorized user. There would not be any lock to remove, any combination to remember and enter, or any sort of special bracelet or ring that one would need to wear or carry to operate the gun. When finished using the gun, the user would simply put it down and it would instantly become disabled. If it were dropped or knocked from the hand it would become disabled instantly. The gun 50 is thus inherently "childproof ", assuming that a child is not an authorized user.

The same gun could be programmed so several people, such as a husband and wife or a peace officer and his/her partner(s) could use it. The guns could only have new fingerprints installed, or unwanted fingerprints removed, at an authorized gun dealer, or, for law enforcement use, at an approved facility such as a police barracks or weapons depot.

Summary specifications for Childproof Gun

Using the A436 Digital Signal Processor chip 10 technology for the image processor and a small fingerprint image sensor, new safer, "smart" or childproof guns can be built with these features:

1) With a high degree of certainty (objective is 99+%), the gun could be fired only by an authorized user

2) No trigger lock, radio transmitter, key or any other safety device is required

3) There is nothing to lose, forget or remove; operation is natural and intuitive

4) Biometrics - a fingerprint - are used to verify the identity of the user immediately before each use

5) Use of the gun is done normally with only one hand; fingerprint activation is fully automatic

6) Multiple (typically up to 10 but could be more) fingerprints can be stored in a single gun 7) A single gun can be programmed to be fired by one or more authorized users

8) User authorization system operates extremely quickly to enable firing the gun

9) With a high degree of certainty (objective is 99+%), a gun cannot be taken from an authorized user and fired by an unauthorized user

10) A gun can be fired multiple times once the user's identity is verified, but becomes disarmed as soon as the user lets go of it

11) The highly miniaturized user authorization system is built entirely into the gun:

12) The fingerprint sensor 56 is built into an ergonomically correct part of a gun to capture a live image of the user's fingerprint

13) The embedded A436 digital signal processor chip 10, operating as an image processor, provides all fingerprint processing and verification

14) Fingerprint data about authorized users is stored in the non-volatile memory 60 within the gun

15) User-identification capability can be built into guns at low cost

16) Long battery life can be provided during storage

17) Fingerprints would be stored in a gun when the gun is purchased and could be updated only by an authorized service facility or law enforcement agency to add or delete fingerprints, or to facilitate the sale of the gun by changing the fingerprints

Having thus described one important application of the A436 chip 10, a description of several other important applications will now be provided. These further applications are not intended to be an exhaustive list of possible applications, as those skilled in the art may derive a number of other applications when guided by the teachings herein.. Door locks

The key to the design is that a fingerprint sensor is built into the doorknob itself so that one simply grasps the doorknob and turns it to open a locked door. The door would become unlocked if the person holding the doorknob is authorized to enter.

Door locks with user's identity authenticated via the Internet

Frequent travelers are often inconvenienced by the need to wait in line at the registration desk of a hotel that they are checking into. Instead, they could make a hotel reservation via the Internet, using their fingerprint to authenticate their identity as disclosed herein, and use their fingerprint instead of a key to enter the room assigned to them in the hotel of their choice.

When the traveler arrives at the hotel, he/she could consult a display unit that informs the traveler of the room assignment if he/she has not obtained this information previously. The traveler could then go to the assigned room and enter it using his/her fingerprints. The door locks in the hotel would respond to fingerprints and be able to have information loaded into them from the Internet as well as from the hotel registration desk. The information would be the "hints" and public keys of anyone who is authorized to stay in a particular room in the hotel.

The same doorlock could allow several people to stay in the same room by unlocking the door in response to the fingerprints of any of the people who are registered to stay in the room. The room assignment could be conditioned on the date and time so that previous guests could not reenter the room. A doorlock could also be programmed to allow certain members of the hotel maintenance and cleaning staff to enter a room at particular times of day to service it.

ATM Machines with user's identity authenticated via the Internet

Automatic teller machines (ATM's) are very popular because they enable customers to obtain cash at any time and without having to go to a bank. However, it is essential to authenticate the identity of the customer so that funds are not stolen. Presently, large sums are lost to persons who have obtained other's passwords or PIN (personal identification number) codes, while others who are entitled to withdraw funds cannot do so because they cannot remember their password or PIN code.

Attempts have been made to use biometrics such as retina and iris scan devices to authenticate a customer's identity, but these devices are expensive, slow and difficult to use, and they frighten some people who do not want anything looking into their eyes. In addition, they are unreliable if someone has been crying, their eyes have been dilated, or they have changed their glasses or contact lenses.

The method disclosed herein for authenticating identity using fingerprints solves these problems.

A customer goes to an ATM and presents a credit card or other card that identifies the account that he/she wishes to use, and that presents the claimed identity of the customer. The customer would enter the information about the desired transaction and would present his/her fingerprint to a fingerprint unit of the type disclosed herein.

The fingerprint unit would instantly compute a public key/private key pair for the fingerprint presented, encrypt a message using the private key and send the message to a computer system that has access to the public keys for all customers. The computer system would attempt to decrypt the message using the one or more public keys for the individual attempting to make the transaction. If the computer system is able to successfully decrypt the message, then the identity of the customer has been authenticated and the transaction is allowed to proceed. Otherwise it is rejected. In either case, the public key/private key pair that were computed in response to the fingerprint would immediately be erased so they could not be stolen.

Such ATM's can also be used to register fingerprint information into the system when a customer has not used the system before. This would avoid having customers having to go into their banks to be able to use the system. If someone has stolen someone else's identity and registers themselves in their place, then an error message will be given when the rightful person attempts to register him/herself with the system.

Public Telephones with user's identity authenticated via the Internet

The use of public telephones to make long distance calls is commonplace. Not only are these calls expensive, but one must use a PIN code to make them and one must enter the access code for the long distance carrier one desires to use if a carrier different from the default carrier is desired. The entering of the PIN code and access code exposes this information to theft either by someone who is watching which buttons on the phone are being pressed, or to someone who is recording the tones being used to make the transaction.

This problem can be avoided by building a fingerprint capture, processing and communications device as disclosed herein into the phone, or attaching such a unit to it, and using an enhanced version of the identity authentication method disclosed herein.

A customer enters the information about the number he/she is calling as well as the access code to select a carrier. The customer would be prompted to place his/her finger on the fingerprint sensor of the fingerprint device. An encrypted message would be produced using the private key that had just been computed. The message would contain information so that this same encrypted message could not be recorded and used again. This message would travel to a server that can access the public key for the customer. The message would be decrypted using the public key and the call would be allowed to proceed if the customer's identity is authenticated.

Alternatively, a "fingerprint keyfob" or other easily carried device could be built with an audible output that could easily be coupled into a telephone transmitter (the part of the phone that one speaks into). This keyfob would contain a fingerprint capture, processing and communications system as disclosed herein but could be used with existing telephones by nature of its audible output. This keyfob could also automate the process of accessing a long distance carrier by having the necessary information programmed into it and output as a part of the communication of the message to the server. This would save the customer the need to remember or enter the information and would promote the use of a particular long distance carrier.

The capability of the fingerprint keyfob could be built into other devices such as cell phones, portable computers and portable digital assistants (PDA's).

Point of sale terminals with user's identity authenticated via the Internet

Using the fingerprint method disclosed herein, a customer could present his/her finger to a fingerprint unit when he/she desires to make a purchase. The fingerprint unit could be integrated within a point of sale terminal or could be standalone. If the person is whom he/she claims to be, the customer's identity would be instantly authenticated and the transaction could proceed, or the identity would be rejected and the transaction blocked.

Electronic Airline Tickets with user's identity authenticated via the Internet

It is common to purchase airline tickets on the Internet. However, one must go to an airline check-in stand or gate to obtain a boarding pass since the Federal Airline Agency (FAA) requires that the identity of the traveler be authenticated prior to boarding. Long waits are often encountered.

The fingerprint method disclosed herein can be used to authenticate the identity of the traveler, greatly reducing the amount of time that one spends waiting.

One or more small computerized units, or kiosks, each containing a fingerprint unit, could be situated near every airline gate. A traveler would identify him herself to the kiosk, such as via a credit card, and present his/her fingerprint to the fingerprint unit. His/her identity would then be authenticated using the method disclosed herein and a boarding pass would be issued.

Since the roster of passengers for each flight is known, the fingerprint information for those passengers could be stored, or cached, local to the gate, speeding up the processing. As an additional security feature since one is not being interviewed by a human airline agent, the kiosk could ask the traveler various security questions and could take a photograph of the traveler. This photograph could be stored as a part of the record of each flight.

Car rental with user's identity authenticated via the Internet

A traveler could conveniently rent and use a rental car in a manner analogous to using a hotel room as described above, with the added provision that the traveler's fingerprints could unlock both the doors and the steering mechanism of the car. The rental car would be able to receive information about persons renting the car directly from the Internet or from other means such as portable data units used by rental car personnel.

A "fingerprint keyfob" that has fingerprint and wireless communications capability as disclosed herein built into it would enable one to open a car's door without touching it, using one's fingerprint as the "key". This fingerprint keyfob could be used for one's own car as well as rental cars.

Electronic Mail with user's identity authenticated via the Internet

The sending of mail over the Internet is one of the most common uses of the Internet, but there is no security provided unless some form of encryption of the message is used. One could make an agreement with each recipient of one's e-mail to use a particular password to decrypt a message, but that is a tedious process and has to be done using a communications method separate from the Internet to ensure the privacy of the communication of that password. It would be more convenient and secure if there were a universal way of selecting a security code.

When e-mail is sent, the claimed identity of the sender accompanies the message. The recipient could use this identity to access a readily available table of public keys. If the sender has placed his/her public key in this table and used his/her private key to encrypt the message sent, then messages can be sent securely by any two parties even though they have not communicated before. The sender, to ensure that his/her private key is available to him/her at all times, such as on travel when he/she does not have access to his/her standard computing devices, could instantly recreate his/her private key using the fingerprint method described herein.

Use of Image Processor Chip with Digital Image Sensors

Geometric and Chromatic Image Correction

Digital image sensors typically have a matrix of red, green and blue color filters over their array of pixels. A common organization of these dots is the Bayer 2G pattern that is formed from a 2 x 2 grid of dots, with green dots on one diagonal, one of the remaining dots being red and the final dot being blue.

Often, the pixels are read out from such a sensor using a line-scan processor that has a small amount of memory and combines a small two-dimensional group of pixels using various weighting factors to produce various forms of video signals such as NTSC. Such a NTSC signal is then passed to and processed by other circuitry. The result is often converted back to a red-green-blue format to drive a video display device such as a liquid crystal display.

A problem with this organization is that image quality is lost when multiple pixels are combined into one, and more image quality is lost when the image is converted back to red-green-blue. It is more efficient to do all of the processing in the format of the pixels that come directly from the sensor rather than making intermediate conversions.

One example of this is the problems involved with inexpensive lenses. One of the reasons for the popularity of digital image sensors, particularly ones made with low cost, CMOS processes, is the low cost of the sensor. It is therefore desirable to use a low cost lens system with such a low cost sensor.

It is also desirable to have a lens that is large compared to the area of the pixel array in the sensor. This enables additional light to be gathered and focused on the sensor so the sensor can produce an acceptable image when the amount of available light is low. However, low cost lenses typically produce chromatic and geometric distortions. Note that geometric distortion does NOT change the color of the light, it merely affects the geometry of the image for a particular range of colors.

This problem can be alleviated by having a processor, preferably the A436 chip 10, obtain pixels from a digital image sensor in the format produced by the sensor and store them into memory that is readily accessible by the processor. Rather than just storing one or a few scan lines of pixels into memory, it is desirable to store large sections or an entire frame of pixels into memory.

The processor can then simultaneously correct both geometric and chromatic distortions in the image. The method for doing this involves several steps.

First, when the system containing the lens, sensor and processor is manufactured, a series of tables are produced and stored in a non-volatile memory accessible by the processor. One table is provided for each of the colors of filters. These tables specify the geometric mapping of pixel input position to pixel output position for the lens used. This mapping can be specified in a piece-wise linear fashion to minimize the size of the tables, or a series of equations can be used.

Then, when the system is in operation, an image is acquired directly from the sensor and stored in memory. A series of output images is formed, one for each color. Then, for each color, pixels of one color are read from the input image and moved to an output image of that same color. The appropriate table is used to properly select one or more input pixels for processing and to position the resulting pixels in the output image.

Finally, once all of the individual colors have been processed, a composite full color image is formed from them.

Alternatively, a low cost but highly light-sensitive monochromatic image sensor can be built less expensively by using a color image sensor, low cost lens and processing as described above to correct chromatic and geometric distortions, than by using a monochromatic image sensor and a high quality lens. Transportation Application

It is desirable in many transportation applications to eliminate blindspots around vehicles. This is true for trucks, buses and automobiles, as well as motorized construction and farming equipment.

When the vehicle is made in a single unit, such as an automobile, wires can be installed in new vehicles to connect one or more image sensors to a processor and display unit. However, the nation has large numbers of tractor-trailers where the vehicle is in two or even three units and the standard set of cables that interconnect them do not have any spare wires in them to carry the video signals.

Video data from the trailer can be conveyed to the tractor via a wireless link, such as radio frequency or infrared, or by placing a high frequency signal on top of one or more of the existing wires that connect the trailer to the tractor. In either case, the amount of data that can be conveyed is limited and it is necessary to use video compression techniques to send sufficiently high quality images.

Each trailer may have one or more, typically three, video cameras affixed to it. These cameras protrude from the tractor and are easily broken thus low cost is important. In addition, it is desirable to maintain the highest possible signal quality from each camera to a processor located at the front of the trailer. Not only does one want to minimize the number of data transmission errors in the connection between the camera and the processor but one also wants to obtain the highest image quality possible. This requires that: (1) image data be sent from the camera to the processor, (2) the processor be able to send commands to the camera to control its operation, such as integration time, and (3) data be obtained from the camera in the format that maintains the highest image quality. When a digital image sensor is used, the data should be obtained from the sensor in the exact pixel-by-pixel format produced by the sensor.

To minimize the number of wires required to connect a camera to the processor, a low voltage differential signaling technique should be used. The processor should be configured so that it can simultaneously receive video information from all of the cameras and store this data into memory it can access easily. The processor would have control logic that manages all of the buffers and enables software running on the processor to determine what location in each buffer is being loaded with new data at any point in time. Depending upon the lighting on each camera, different cameras may operate at different frame rates, and all cameras operate independently of one another. In addition, different cameras may produce different size images.

To avoid having to lock the timing of the cameras together, the processor would use separate sections of the same memory to buffer the data for each of the cameras. Each buffer would operate in a circular fashion and be able to store at least two complete video frames. The processor can be instructed by a message from the tractor to operate upon the camera data and to form a composite image from all of the cameras and store this composite image in yet another portion of memory in a circular fashion. The processor would make use of the control logic for the buffers to obtain the most recently received full frame of data.

The processor could then compress the resulting image and pass it on to the tractor.

This invention has been described in the context of various preferred embodiments and applications thereof. However, the teachings of this invention are not intended to be limited to only these presently preferred embodiments and applications, as those skilled in the art may derive various modifications to same, when guided by these teachings. Thus, the teachings of this invention are intended to be given a broad and reading.

Claims

ClaimsWhat is claimed is:

1. A digital signal processor device, comprising:

a data cache;

a scalar arithmetic unit coupled to said data cache;

a parallel processing unit coupled to said data cache, said parallel processing unit comprising an n row by m column array of parallel arithmetic units, and

n pattern matcher coprocessors individual ones of which are coupled to one of said n rows of parallel arithmetic units.

2. A digital signal processor device as in claim 1, wherein said device is programmed using one-dimensional and two-dimensional parallel data types.

3. A digital signal processor device as in claim 1, wherein a program for said device is made using a method that defines an amount of parallelism, and using a compiler to produce code having the same amount, or a lesser amount, of parallelism than used to define the program.

4. A digital signal processor device as in claim 1, wherein said device comprises circuitry for directing different groups of operands to different ones of said parallel arithmetic units, depending on the instruction being executed.

5. A digital signal processor device as in claim 1, wherein said device comprises circuitry for accessing multiple operands from said cache memory, where all of the operands can be accessed in a single clock cycle, regardless of the address of the first of the multiple operands and regardless of the placement of the set of parallel operands within one or more cache pages.

6. A digital signal processor device as in claim 1, wherein said device uses an instruction word that can contain either a set of bits that defines a mode of operation, the control of said scalar arithmetic unit and said parallel processing unit, and is executed as an entity, or a set of bits that defines a mode of operation and two sets of operations for said scalar processing unit, or a set of bits that defines a mode of operation and two sets of operations of said parallel processing unit.

7. A digital signal processor device as in claim 1, wherein said device comprises circuitry for accessing an instruction from said cache memory where all of the bits of the instruction can be accessed in a single clock cycle, regardless of the address of the first byte of the instruction and regardless of the placement of the entire set of bits within one or more cache pages.

8. A digital signal processor device as in claim 1, wherein said device executes instructions that feed one set of operands to a first group of parallel arithmetic units, and a second set of operands that contain a portion, but not all of, the first set of operands to a second group of parallel arithmetic units.

9. A digital signal processor device as in claim 1, wherein said device executes instructions that feed one set of operands to a first two groups of parallel arithmetic units, and a second set of operands that contain a portion, but not all of, the first set of operands to a third and fourth group of parallel arithmetic units.

10. A digital signal processor device as in claim 1, wherein said device comprises circuitry for writing selected operands to memory from said parallel arithmetic units, where the connection of the parallel arithmetic units to memory can vary from one instruction to the next.

11. A digital signal processing device as in claim 1, wherein data that is loaded into said data cache is output from an image sensor.

12. A digital signal processor device as in claim 1, wherein data that is loaded into said data cache represents all or a portion of an image of a fingerprint.

13. A digital signal processor device as in claim 1, wherein individual ones of said n x m array of parallel arithmetic units each comprise a register coupled to said scalar arithmetic unit, whereby said scalar arithmetic unit is enabled to broadcast a value to individual ones of said parallel arithmetic units.

14. A digital signal processor device as in claim 1, wherein said device comprises circuitry for directing different groups of operands to different ones of said parallel arithmetic units, depending on the instruction being executed, such that groups of adjacent bytes are simultaneously processed in a staggered fashion.

15. A digital signal processor device as in claim 1, wherein data that is loaded into said data cache represents all or a portion of an image of a fingerprint, and wherein said digital signal processor device processes said data and compares a result to stored data representing at least one pre-stored fingerprint image.

16. A digital signal processor device as in claim 15, and further comprising at least one mechanism coupled to said digital signal processor device for being controlled in accordance with an outcome of said comparison.

17. A digital signal processor device as in claim 16, wherein said mechanism comprises a locking mechanism for a doorknob or a weapon, wherein said doorknob or weapon comprises a fingerprint image sensor for generating said image of the fingerprint.

18. A digital signal processor device as in claim 15, wherein said stored data is received from a data communications network.

19. A digital signal processor device as in claim 1, wherein n=8, m=4, and wherein said data cache has a width of 128 bits.

20. A digital signal processor device as in claim 1, wherein parallel operands comprise up to 16 adjacent bytes that can be placed on any memory byte address in said data cache, and which can be accessed, sign-extended or zero-extended to 16 bits, and processed in a single CPU cycle, wherein two-dimensional, imaging-oriented parallel data types are processed, and wherein instructions can be executed for parallel implementation of matrix-vector multiply, convolution, matrix transpose, and histograms/table- lookups/erosion/dilation operations.