WO2001099430A2

WO2001099430A2 - Audio/video coding and transmission method and system

Info

Publication number: WO2001099430A2
Application number: PCT/CA2001/000893
Authority: WO
Inventors: Kimihiko E. Sato; Kelly Lee Myers
Original assignee: Kyxpyx Technologies Inc.
Priority date: 2000-06-21
Filing date: 2001-06-21
Publication date: 2001-12-27
Also published as: CA2312333A1; WO2001099430A3; AU6722801A

Abstract

The invention is the method and system of delivering a television-like viewing experience with the ability of start and switching programs in a near-instantaneous fashion using any non-acknowledged and unreliable packet based transmission mechanism such as the Internet. The embodiment of this invention is a new data transmission protocol based on minimal acknowledgement, in a method that requires unique multimedia data coding methods. The user experiences near instantaneous multimedia playback upon requesting a viewing channel without the typical pre-buffering step, without any subsequent re-buffering step. Any change in the selected viewing channel can be done with sub-second response time, and at no time during the user experience, barring catastrophic extended breakdown of the Internet fabric, is the viewing experience stopped for re-buffering of information. Program channel changing, multiple camera angle programs, multiple audio track programs, pan and scan programs, and pointer device feedback are a few of the advanced delivery possibilities that can be achieved with this transmission protocol and data coding method with a minimal cost in transmission overhead that adapts continuously to the bandwidth available between client and server.

Description

MULTIMEDIA COMPRESSION, CODING AND TRANSMISSION METHOD AND SYSTEM

FIELD OF THE INVENTION

The present invention relates to data compression, coding and transmission. In particular, the present invention relates to multimedia data compression, coding and transmission, such as video and audio multimedia data over a non-acknowledged and unreliable packet based transmission mechanism.

BACKGROUND OF THE INVENTION

Text, Picture, Audio and Video data are generically termed multimedia data, and the long-term archival storage of this information is termed multimedia storage.

Coding and its inverse step, decoding, describe how compressed information of video and audio data are stored, sequenced, and categorized. Content creation is the step where the multimedia data is created from raw sources and placed into multimedia storage. The apparatus that captures, encodes and compresses multimedia data is termed the content creation apparatus. The decoding, decompressing, and display apparatus for multimedia data is termed display apparatus. Transmission is the step where the multimedia data is transported via the communication medium from multimedia storage to the display apparatus. Rendering is the step where the multimedia data is converted from the encoded form into a video and audio representation at the display apparatus.

The constant aim within video data compression and transmission systems is to achieve a high degree of data compression with a minimum degree of degradation in image quality upon decompression. Generally speaking, the higher the degree of compression used, the greater will be the degradation in the image when it is decompressed. Similarly, audio compression systems aim to store the best quality reproduction of an audio signal with the least amount of storage. In both audio and video compression, a model of human perception is used so that information that is lost during compression is ordered from least likely to most likely to be perceived by the end recipient.

^• The Internet is a network of machines where any data sent by a transmitter may or may not arrive in a timely fashion, if at all, at the intended receiver. Current distribution of audio and video information using the Internet relies on pre-buffering a data stream prior to the presentation of the information. This is because the distribution of video over the Internet is currently reliant on the file formats used for compression and coding of the video information. Examples of such file formats are MPEG 1, MPEG 2, MPEG 4, QuickTime, AVI, and Real Media files. In all these cases, the portion of the file that is to be viewed must be sent and reassembled in its entirety at the receiver in order to correctly render the multimedia information. Current systems use either TCP (Transmission Control Protocol) or RTP/ RTCP / RTSP (Real Time Protocol / Real Time Control Protocol / Real Time Streaming Protocol) to ensure contiguous and error free data reception. This current streaming methodology can be explained as downloading a file, and starting the playback of the file prior to the entire file being received. The pause at the start of steeaming prior to playback commencing is called pre-buffering, and any extended stops for refilling the buffer again when the playback point overruns the end of currently received data is called re-buffering. It is this pre-buffering and re-buffering step that this invention aims to eradicate.

The conventional approach to coding information is to accommodate the lowest common denominator format, which most of the world can play using the hardware and infrastructure present. This file format is optimized for lowest bit rate coding, so that the fewest amount of bits are needed in order to reconstruct a representation of the original information. These formats, such as MPEG1, MPEG2, and MPEG4, are designed for small bit rate coding errors, such as is found in any kind of radio transmission (microwave, satellite, terrestrial). These formats are not designed specifically for the Internet, which is unreliable on a packet by packet basis rather than on a bit by bit basis.

Conventional video and audio compression algorithms and coding systems rely heavily on committee based standards work, such as MPEG2 from the MPEG committee, or H.261 from the ITU- T. These describe a multimedia data file in multimedia storage that can more or less be transmitted error-free and at a reliable rate to the decoding and decompression apparatus.

A disadvantage of MPEG type information is that it achieves the compression rate by inter- frame coding in which only differences between successive frames are recorded. The encoding typically attempts to determine the differences from both earlier and later video frames. The encoder then stores only a representation of the differences between the earlier frame and the later frame. The problem is that the differences are meaningless without a reference frame. This means that the stream needs to be coded for the bit rate that it expects to have available. The bit rates that are specified in MPEG documentation are continuous, reliable, and reproducible. The Internet is far from any of these. The audio information is typically interleaved within the same media storage file in order to correctly synchronize the audio playback with the video playback. In conventional systems, the multimedia data that is determined at the encoding step must be transmitted in its entirety to the decoding, decompressing apparatus. When motion prediction algorithms are used in the content creation step, there is a large amount of computation required at both content creation and rendering. This then means that it is more expensive in hardware costs to do real-time content creation, and rendering.

The conventional approach also starts with television, and tries to reproduce it using the Internet as the transport mechanism for the information. In order to achieve this, the usual way is to use either TCP or RTP transports to send the MPEG coded information across the net. TCP is a protocol heavy transport, because the aim is to have the exact copy transferred from sender to receiver as fast as possible, but with no guarantee of time. In order to compensate for unreliable and unpredictable transmission channels, such as the Internet, the conventional approach is to use pre-buffering, but in some cases, tens of seconds to several minutes of time is spent collecting pre-buffering data instead of presenting pictures with sound to the viewing and listening audience. This delay before the appearance of images or even sound can be annoying or unacceptable to the user. It is generally preferable to have audio sound and visual images even of moderate quality, for example a slow image frame rate or even a still image with audio rather a blank screen while waiting for additional multimedia data to be received. Such an approach is based on a model of human perception and preferences. To take the familiar example of changing channels on a television set. Having to wait ten seconds before receiving any image and sound could appear to be quite a long time, whereas it is less noticeable and likely less irritating if there is near instantaneous audio sound and visual images available, albeit at reduce quality. Of course, the image and sound will increase in quality as any initial delays are over and the system adjusts to the conditions of the transmission medium (e.g. Internet).

Conventional systems require that the width, height, compression level, and audio quality, are all determined at the time of compression. Even if the decoding and decompression apparatus is capable of handling a much higher quality image, the resulting experience for the user has been limited to the least capable playback device. There is no mechanism for altering the experience on- the fly based on the capabilities of the display apparatus. Accordingly, the creator of the MPEG type file needs to decide at creation time what the size and quality of the image is, and the quality of the audio. If a smaller, lower quality derivative of this image or this audio is required, the creator needs to make a separate file for each variant.

The sizes and the limits of the video data is typically limited to height to width ratios of standard NTSC and PAL video, or 16:9 Wide-screen movie, as this is the standard source of moving pictures with audio. This should not be the only possible size. Parallel to the development of video and audio compression, the field of still picture compression, such as JPEG, PNG, and CompuServe GIF, are fairly straightforward. These are usually symmetrical algorithms, so that the content creation step is of roughly equivalent complexity to the rendering step. When still pictures are flashed changing at a high enough frame rates, the illusion of motion is created. Motion JPEG (MJPEG) is a system used in non-linear video editing that does just that with still JPEG files. This is simply a video storage system, and does not encompass audio as well. There is a need for a new type of video compression method that overcomes the above-noted deficiencies in the prior art methodologies.

SUMMARY OF THE INVENTION It is an object of the present invention to mitigate or obviate at least one disadvantage of known multimedia coding and transmission methods.

According to an aspect of the invention, there is provided a motion picture with audio compression, coding and transmission system and method comprising: a transmission channel utilizing a single source and destination pair of addresses for both control and data, at least one transmission relay device, and at least once reception device, along with a method of coding and compressing multimedia data such that a redundant set of variably encoded audio and text information can be sent adaptively with the video in a minimally acknowledged transmission protocol.

According to a further aspect of the invention, there is provided a video and audio compression, coding and transmission method and apparatus comprising: a communication channel coupled to one transmitter device, at least one transmission relay device, and at least one reception device, along with a method of coding and compressing multimedia data, such that there are multiple levels of detail and reproducible coherence in multimedia data such that that a redundant set variably encoded of audio and text information can be sent adaptively with the video in a minimally acknowledged transmission protocol. According to a further aspect of the present invention there is provided a method of encoding a frame of multimedia data from a transmitter to a receiver comprising: encoding a portion of image data of the frame into a first data packet, the first packet for use by the receiver to generate a viewable image upon receipt thereof; and encoding the remainder of the image data into a second data packet, the second packet for use by the receiver to generate, in conjunction with the first data packet, an enhanced version of the viewable image.

According to a still further aspect of the present invention there is provided a method of receiving multimedia data from a transmitter comprising: receiving a first data packet, the first data packet for use by the receiver to generate a viewable image; receiving a second data packet, the second data packet for use by the receiver, in conjunction with.the first data packet, to generate an enhanced version of the viewable image; and sending to the transmitter a request for additional packets.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will now be described, by way of example only, with reference to the attached Figures, wherein: Figure 1 shows an system for the creation, transmission, and rendering of streamed multimedia data;

Figure 2 shows a data format to store multimedia data;

Figure 3 shows a method for creation of multimedia data;

Figure 4 shows a method for front half compression and encoding;

Figure 5 shows a method for the diagonal flip step; Figure 6 shows a method for front half data packing; Figure 7 shows a method for front half decompression and decoding; Figure 8 shows a method for Picture Substream sample; Figure 9 shows pixel numbering; and Figure 10 shows a tiled representation.

DETAILED DESCRIPTION

The present invention is a system and method that caters to the digital convergence industry. It is a format and a set of tools for content creation, content transmission, and content presentation. Digital convergence describes how consumer electronics, computers, and communications will eventually come together into being part of the same thing. The key to this happening is that the digital media, the format for sound, video, and any other data related to such, are recorded, converted and transmitted from source to destination.

As used herein, the following terms are defined:

• Packet = a single unit of data that can be sent unreliably across the Internet. • Frame = one full picture from the video.

• Video = a series of frames that need to be shown at a certain rate to achieve the illusion of continuous motion.

• A/V = Audio / Video = Audio and video put together into a single stream.

• MPEG = motion picture experts group. This is an ISO standards workgroup that concentrates on moving images.

• JPEG = joint photographies experts group. Similar ISO standards workgroup that concentrates on still images.

• Bit = single unit of information comprising of either a 1 or a 0.

• Octet = Byte, or 8 bits of information • Ethernet Frame = IEEE 802.x based unit of data, which is maximum 1500 octets of information.

• IP Packet = a single unit of data that can be sent unreliably across the Internet. An IP packet may be broken down into smaller IP packets by a process called IP fragmentation, which is undone by the receiver using a process called IP reassembly.

• Image Frame = a single full picture from the video stream. Cinema motion picture is comprised of 24 image frames per second. NTSC television is 29.97 image frames per second. Image Frame is used in this document to differentiate it from an Ethernet Frame.

• Image Frame Rate = the rapid presentation of a succession of image frames to a viewer that achieves the illusion of motion.

In order to eradicate pre-buffering and re-buffering, the present invention comprises a different protocol that uses the base behaviour of the Internet fabric as the base of its design. This new protocol uses a 'many data packets to one acknowledgement packet' protocol, unlike TCP, which uses a 'single data packet to one acknowledgement packet' protocol. For example, a client might only send an acknowledgement and request for additional data to the server every second whereas many data packets would be received by the client during a one second period of time. The client would also send information to the server based on current conditions. For example, if the screen displayed to the user is relatively small then a there is a limit on the amount of resolution required and the server can be asked to send an image in an appropriate fashion. Another example is the rate of information that can be successfully received by the client over the Internet at that point in time. For example, if the image frame rate used for encoding is 30 images per second but the connection only allows a maximum rate of 15 frames per second then the server will be asked to transmit at this lower rate. In addition, the client can ask the server to transmit every other frame so that there is no lag between the rate of display of the image to the user and the intended image rate.

The protocol also uses only one Internet Address and Port pair per connected client rather than RTP/RTCP/RTSP, which uses one pair for setup, teardown, and control of a session, and additional pair or pairs for the actual data transmission. The use of a single set of addresses is faster and consumes fewer resources.

The client gets information from a server in a request that maximizes data efficiency, which we define as the ratio of the useful information received to the total information received. In this regard, data that is too late to be rendered in time, partial information that cannot be rendered without some additional data, and redundant information that is received but is useless because it is already present at the receiver, will reduce data efficiency.

Clients request data from a server, which in turn gets its information from the data source. Clients do the work in calculating the data that is required and in prioritizing the data it needs to request, which it sends to the server on a periodic schedule. The server, after it receives a request and has validated it as a legitimate request, simply fulfils the request and sends the data back along the same channel as the request came in on at the exact data rate that the client requested. Any newer request immediately supersedes an older request at the server.

The data requested by the client is prioritized in accordance with a model of human perception. The method prioritization of the present invention comprises the following: • Audio data is received before video data, because it is more important for the audio to be continuous even at the cost of losing an image frame of video.

• Higher priority audio data is requested before lower priority audio data.

• Higher priority video data is requested before lower priority video data.

• Image frame rate may be scaled back to accommodate a smaller transmission channel. • Picture quality may be scaled back to accommodate a smaller transmission channel. • Audio quality in bit rate, or number of channels (stereo / mono) may be scaled back to accommodate a smaller transmission channel.

• The request is tailored to current Internet conditions between the server and the client so that a complete stop in the playback of the program occurs as the last resort in the case of catastrophic and long-term stoppage in the continuity of the Internet fabric.

• The system continues to function properly even if the occasional request packet or data packet is lost by the system.

The present invention also requires a new format for the multimedia data that has unique characteristics. It is desirable that the data be stored at the information distribution point (the server) in a form that has the following characteristics:

• The compressed audio data can is quickly parsed so that the data corresponding to any particular time and duration in the program can be easily found.

• This compressed audio data is packed into chunks of information that does not rely on any other data chunk being previously received. • The compressed video information can be quickly parsed so that compressed data corresponding to any image frame can by easily found.

• In addition, the compressed image frame data can be further parsed so that a low quality representation of the image frame, or a smaller sized representation of the image frame can be reconstructed by selectively removing data. • In addition, the information required to modify this previously described image to be a full quality or a full sized image simply requires the previously removed data to be applied.

MPEG and H.320 / T.120 do not fit these criteria as they are inter-frame based, i.e only the differences between successive image frames are recorded. The information for a single random image frame would require that the full base frame (I frame) be found, decompressed, and all required inter- frame difference patches (P frames, B frames, and BP frames) to be applied before a single random image frame can be made. Also, the full quality and full size is the only choice for the data for the whole program. Of course the entire program could be encoded small, or at a lower quality. The size and quality level is decided at encode time, and there is no provision to switch on the fly once a program is requested. The stream is tolerant of bit errors, and is coded for the bit rate that it expects to be available in a continuous fashion. This is the case for terrestrial or satellite digital television transmission, but is not the case for the Internet, which is packet based, and is highly unpredictable on a packet-by-packet basis.

A moving still picture compression format is required for this to work. Several examples of such a format and system for video and audio that fits these criteria is described, which are based on standard continuous tone compression algorithms such as JPEG, JPEG2000, PNG, and GIF89. A further example data format is provided in the detailed description.

Although we describe in the examples an implementation over the Internet communications fabric, this invention describes an approach applicable to any non-acknowledged and unreliable transmission mechanism. The system of the present invention encodes the information at the highest size and quality that will be requested by any client. It is then encoded in a way that makes it easy during the playback for a client to downscale the quality or size based on the conditions of the network.

In order to achieve this, the widely used conventional approach of using inter-frame coded MPEG, had to be abandoned. This instantly eliminates the requirement of a reliable transport mechanism such as TCP/IP.

Instead, the present invention relies on a transport mechanism having a single bidirectional channel with more or less uniform propogation time from sender to receiver, and the reverse. Straight unicast and multicast UDP/IP are examples although we are not limited to these at all. Instead of an inter-frame approach, the present invention uses a moving still picture compression format. As most motion picture is originally on film stock, any continuous tone image compression algorithm, such as, but not limited to, the discrete cosine transform (DCT) in JPEG, or the Wavelet image compression algorithm in JPEG2000, can be used to compress the frames. The audio corresponding to the film is captured separately, and then encoded within the frame data as comment fields. The audio can encoded by any audio compression algorithm. Referring to Figure 1, a system according to an embodiment of the present invention has a multimedia content creation apparatus capable of receiving audio and video data. Multimedia data created by the content creation data can be transmitted to and stored in data storage devices. The created multimedia data may be stored alongside data from other sources.

The multimedia data is accessible by a server apparatus or server having a multimedia transmission device, such as a modem, for transmitting data across a telecommunications network, such as the Internet, to a client apparatus or client. According to the present embodiment, the data is transmitted over the Internet using one or more relays or routers. These relays, however, are not necessary and a direct connection including a wireless connection between the server and the client is possible as well. The client includes data receipt device, such as a modem, and a multimedia rendering apparatus that includes a display device such as a conventional monitor or a wide screen projector. The source of the program is either the point where data is encoded in a live stream, or the point where the program day is controlled and then distributed to all relays, which then feed all reception devices. Relay devices need to subscribe from either this source device or from another relay, and may, through a process of trans-coding, convert from a source format of data into the form of data required by this invention. In a special case, the relay device may also be the source device, but that is not required.

Relay devices listen for control requests from any display devices on a single UDP address/port destination pair. These requests may be authenticated and subsequently discarded if they do not pass the authentication. Validated requests are entered into a work queue for fulfillment.

Fulfillment of a request consists of retrieving the exact data that is requested by the display device, and sending it in a controlled schedule back through the same channel that the request was received on.

This utilization of a single channel for control in one direction and data in the opposite direction is different from RTP/RTSP/RTCP and FTP, which requires a separate control channel from any data channels. HTTP, which is used by the "World Wide Web"(www), uses one channel for both control in one direction and data in the opposite direction, however, HTTP uses TCP instead of UDP. In order to make such a system efficient, we need to store the information at these relay devices in a format that makes it possible to obtain a subset of the full data such that a smaller image, a lower quality image, or a reduced frame rate can be rendered efficiently with minimal calculations at the server. In addition, the data format should be such that another subset of this data can be obtained that can be applied in conjunction with the previously mentioned smaller, or lower quality image can result in a larger or higher quality image. Faster frame rates can be achieved by simply requesting more images. This initial smaller lower quality image data subset is termed the "Front half data", and the data that need to be applied to this to obtain the larger or higher quality image is termed the "Back half data".

This system requires the use of a still image continuous tone compression algorithm, such as

JPEG, MPEG (I-FRAME ONLY), or JPEG 2000. Of these, only JPEG 2000 has the capabilities inherent in order to obtain subsets of data at a reduced quality level of a smaller size, although further improvements can be made that use "hints" stored within the image as a series of binary comment fields.

The audio corresponding to the film is captured separately, but time coded to match, and then encoded using an audio compression algorithm such MPEG 2 Layer 3. The requested time period of audio data needs to be quickly obtained from within this set of audio data, and may be further transformed at transmission time with a forward error correction algorithm that allows for immediate recovery in the case of single audio packet loss. Audio packets may be further encoded as comment fields within the previously mentioned image packets and sent packaged with the motion picture, but separate audio and picture packets may be used as well. All data received at the reception device is buffered in a method that allows for a periodic update control signal to be sent to the relay device, which supercedes any previously sent control signal. The reception device processes information regarding the received data, and decides the priority of the next required data packets. This prioritization is done by the reception device in order to ensure that at all times the data playback will not run out of information. Immediately at the start of playback, or after a change in channel, the buffers at the reception device are purged. Only audio data is requested in a burst in order to prime the audio playback pump. Front half data at a low frame rate is then primed into the buffers so that at all times there is some movement that can be rendered without a stop in motion. When a long enough buffer of small, low quality pictures at a slow frame rate is achieved, then back half data is requested as well as front half data at a higher frame rate. The basic aim of this prioritization scheme is to maximize the use of the buffers within the routing system of the Internet, in order to present an experience similar to standard television.

Optionally, higher security can be obtained in a graded fashion, so that there may be encryption of the data at storage or during transmission, and the back half data may be set at a higher security level than the front half data.

A further innovation is to flip the image diagonally prior to compression. This innovation is useful only because most image file formats store data sequentially in the file in raster order I.E. rows, not columns. The aim is to obtain the horizontal subset data without requiring a decompression and recompression of the full image .The effect of this is that column of pixels is transformed into a row of pixels, which corresponds to a conventional row oriented storage of pixels in memory. What this means is that instead of the sequential information in the file corresponding to the picture information from top left to bottom right of the image by rows, the information corresponds to picture date from top left to bottom right by columns. Immediately prior to final rendering, the image flipping is undone for the small subset image. This introduces further work at both the compression step and the decompression step, but it facilitates left to right reduction of the image without the server having to decompress the entire image.

This diagonal flip is optional and need not be present to practice the invention. It is, however, desirable as it permits data corresponding to vertical slice of a wide horizontal image to be obtained more efficiently. For example, if the image is a panoramic view of nearly 360 degrees but only a 45 degree slice centred on the North is desired then this innovation permits its efficient extraction.

Another coding innovation is to encode a reduced resolution image with a low bit rate coded audio signal as the front half of the encoded frame data. The information required to modify this image into a higher resolution and higher quality image, as well as the corresponding high frequency encoded audio is encoded as differences in a back half of the encoded data. The back half components can be encrypted at high security levels, which allows for a lower quality rendition to be available for a lower price, etc.

According to an aspect of the present invention, a receiver receives from a transmitter multimedia data through an unacknowledged unreliable packet based transmission medium. The image data can be provided by first and second packets. The first packet contains data corresponding to a portion of an image. The second packet contains data corresponding to the remainder of the image. The receiver uses the first packet to generate the portion of the image which is typically the originally captured image but reduced in quality or size or resolution. If the receiver receives first and second packets then the receiver uses the data in the first and second packets to generate an enhanced image corresponding to the originally captured image. In addition, audio data or audio information can be transmitted by packets. Typically, audio data corresponding with the image data and for synchronous reproduction therewith is transmitted by a set of one or more packets. This set of one or more packets can be distinct from the first and second packets for the image or the set of one or more packets can include the first packet, the second packet or both. Similarly, ancillary multimedia data and tertiary information can be sent by another set of one or more packets which may or may not be the same as the first set of one or more packets and may or may not include either of the first and second packets.

Example:

The following detailed example illustrates an embodiment of the invention and includes a possible data format scheme. Consider a motion picture with audio signal at 24 frames a second. The multimedia data format of these frames is illustrated in Figure 2.

Capture the audio at 44.1 KHz Stereo PCM digital data format. This is referred to as the Raw Audio Data (RAD). Convert the audio signal into a 44.1KHz Stereo PCM audio data file. This is referred to as the Back Half Audio Data (BHAD). ' Convert the audio signal into an 1 lKHz Mono PCM audio data file. This is referred to as the

Front half audio data (FHAD).

Capture the video at 24 frames per second into 720 x 576 pixel progressive picture files. Each pixel is RGB888 (CCIR601), which means that there are 8 bits of precision in the capture of the each of the Red, Green, and Blue channels according to a color profile supplied by the Digital TV industry committee (CCIR). This is chosen because it is one of the standard MPEG 2 video profile sizes and formats. Referring to Figure 10, the image consisting of RGB pixels is depicted using a tiled representation. This will be referred to below in computing the YUV representation of the images. Note that each pixel is represented by a red (R), green (G) and blue (B) component. For example, in Figure 10, Rll represents the red component of the pixel in the upper right hand corner located at (1,1). A standard movie encoded to 30 frames per second for television usually does a process called 3/2 pulldown, which means that every fourth frame is doubled. This means that no extra information is being conveyed in that last frame, so we might as well just capture only the frames. A single frame of this information is referred to as the Raw Video Data (RVD), and all these frames collectively are referred to as the Raw Video Data Stream (RVDS). Referring to Figure 5, each frame is noise filtered and diagonally flipped to become a new image where the horizontal lines correspond to the columns of the original image. Referring to Figure 8, if there is any black band removal on a frame by frame basis, it is done at the same time as this step. This is referred to as the Flipped Video Data (FVD).

Referring to Figures 4 and 9, the FVD is converted into a new image that is half the width and half the height by a process of collecting every other pixel. It is important that this is collected and not averaged with adjoining pixels. This frame of information is referred to as the Front Half Video Data (FHVD), and is still converted into YUV format. In this example it is the lower right pixel of each 2 by 2 block that is collected.

The pixels that have not been collected into the FHVD are collected and encoded. This new representative of the data is now referred to as the Back Half Video Data (BHVD), and consist of four planes, the delta left intensity plane (dLYP) the delta right intensity plane (dRYP), the delta U plane (dUP) and the delta V plane (dVP).

These last two steps are detailed as follows.

(a) Divide the FVD into 2 by 2 blocks. These are pixels with identifiers 11, 12, 21 and 22 based on their cartesian coordinates. The RGB values of each pixel is R(xy), G(xy), and B(xy) where xy is the pixel identifiers.

(b) compute the YUV representation of the all four pixels into Y(xy), U(xy) and V(xy) using the matrix conversion formula as follows:

Y 0 +0.29 +0.59 +0.14 R U = 50% + -0.14 -0.29 +0.43 x G

V 50% , +0.36 +0.29 -0.07 B

(c) Calculate the delta values of the YUV data values with pixel 22, the bottom right pixel. These would give the following delta values. dY(ll) = Y(ll) - Y(22) dY(12) = Y(12) - Y(22) dY(21) = Y(21) - Y(22) dU(ll) = U(ll) - U(22) dU(12) = U(12) - U(22) dU(21) = U(21) - U(22) ddVV((llll)) == V V((llll)) -- V(22) dV(12) = V(12) - V(22) dV(21) = V(21) - V(22)

(d) Average the values for the delta U values to get dU. dUavg = [ dU(ll) + dU(12) + dU(21) ] 3

(e) Average the values for the delta V values to get dU dVavg = [ dV(ll) + dV(12) + dV(21) ]

3

(f) Collect all left side Y pixel delta values, dY(l 1) and dY(21) into a plane, and refer to it as the delta left intensity plane (dLYP),

(g) Collect all upper right Y pixel delta values dY(12) into a plane, and refer to it as the delta right intensity plane (dRYP),

(h) Collect all dUavg values into a plane and refer to it as the delta U plane (dUP), (i) Collect all dVavg values into a plane and refer to it as the delta V plane (dVP), Using our original (720x576) pixel picture size, the flipped image FVD would be (576x720).

This would mean that dLYP is (288x720), dRYP is (288x360), dUP is (288x360) and dVP is (288x360) in planar image size. In this example each plane has elements that have eight (8) bits of precision. That is for efficiency of implementation in software and should not be a restriction on hardware implementations. Each plane is put through a continuous tone grey scale compression algorithm, such as a single plane JPEG. Prior to this though, the FHVD, dLYP, dRYP, dUP, and dVP are divided into horizontal bands, which correspond to vertical bands of the original image. If there were four bands with our (720x576) example, then the FVD of (576x720) becomes a FHVD of (288x360) consisting of four bands each sized (288x90). It is allowable to have a single band encompassing the entire image, and for efficiency it is suggested that a power of two number of bands be used. The FHVD is compressed in the three equally sized component planes of YUV using a continuous tone image compression algorithm such as, but not limited to, JPEG. Each of these planes are (288x360).

The FHVD and the FHAD are interleaved with frame specific information such that the audio data, video data and padding are easily parsable by a server application. This is referred to as the Front half data (FHDATA). In the case that JPEG was used, this FHDATA should be parsable by any standard JPEG image tool, and any padding, extra information, and audio is discarded. This image is of course diagonally flipped, and needs to be flipped back. The FHAD is duplicated in a range of successive range of corresponding frames. This is so that only one of a sequence of successive frames need to be received in order to be able to reproduce a lower quality continuous audio representation. The BHVD and BHAD are stored following the FHDATA in a way so that the server can easily pull individual bands of the information out from the data. The BHAD is duplicated in a successive range of corresponding frames. This is similar to the FHAD in the FHDATA but the difference is in how redundant the information is when dealing with high frequency data. The aim is to have some form of audio available as the video is presented. The BHVD and BHAD interleaved in this form is called the back half data (BHDATA).

A frame header (FRAMEHEADER), the FHDATA and the BHDATA put together is the complete frame data (FRAMEDATA).

A continuous stream of FRAMEDATA can be converted to audio and video. This is referred to as streamed data STREAMDATA. A subset of FRAMEDATA can be constructed by the video server device. This is referred to as subframe data SUBFRAMEDATA and a continuous stream of this information decimated accordingly is referred to as subsampled stream data (SUBSTREAMDATA).

A collection of FRAMEDATA with a file header (FILEHEADER) is an unpacked media file (MEDIAFILE), and a packed compressed representative of a MEDIAFILE is a packed media file (PACKEDMEDIAFILE). The server apparatus will on read a MEDIAFILE, or capture from a live video source, and create a STREAMDATA that goes to a relay apparatus.

A client apparatus contacts a relay apparatus and requests a certain STREAMDATA. Through a continuous feedback process, the relay will customize a SUBSTREAMDATA based on the current instantaneous network conditions and the capabilities of the client apparatus, and by specific user request such as, but not limited to, pan and scan locations.

SUBFRAMEDATA is created from the FRAMEDATA by a process of decimation, which is the discarding of information selectively. The algorithm for discarding is variable, but the essence is to discard unnecessary information, and least perceivable information first.

Only complete SUBFRAMEDATA elements that are reliably received in their entirety are rendered. All others are discarded and ignored. The rendering step is as follows:

The audio data is pulled from the SUBFRAMEDATA. If BHAD exists, then it is stored accordingly. FHAD always exists in a SUBFRAMEDATA and is stored accordingly.

Referring to Figure 7, the FHVD which is always available, is decompressed accordingly into its corresponding YUV planes. This is stored accordingly.

The (BHVD), if it is available, is used to create a decompressed full size image using the following algorithm:

(a) reverse the continuous tone compression algorithm so that there is a reconstructed [dRYP], [dLYP], [dUP], and [dVP] [square braces used to indicate reconstructed values. (b) values from each plane and from the FHVD are interleaved into a YUYV data block. From [FHVD] [Y22] [U22] [V22]

From [dRYP] [dYll] [dY21]

From [dLYP] [dY12;

From [dUP] [dU]

From [dVP] [dV]

[Yll] = ;dYll] + [Y22]

[Y12] = ;dY12] + [Y22]

[Y21] = [dY21] + [Y22]

[Ul] = [dU] + [U22]

[U2] = [U22]

[VI] = [dV] + [V22]

[V2] = [V22]

[Yll] [Ul] [Y21] [VI] is the YUYV representation of the left two pixels [Y12] [U2] [Y22] [V2] is the YUYV representation of the right two pixels

All pixels are collected together into an intermediate frame. This display is then put through the final reconstruction step of reversing the diagonal flip with another diagonal flip of the picture elements. Following this step the columns of YUYV data calculated above are now rows of YUYV data, in the exact format that computer video card overlay surfaces require.

During this reverse diagonal flip, an optional filtering step can be done to further remove any visual artifacts introduced during compression and decompression.

The available image is displayed at the appropriate time in the sequence. If high quality audio is available, then it is played on the audio device, otherwise the lower quality audio sample is used.

The client monitors the number of frames that it managed to receive and it managed to decompress/process. This is reported back to the server which then scales up or down the rate and the complexity of the data that is sent. According to the present example, the client will send a request for additional information every second. Of course, other schemes can be used.

Note that in the example, even if a data packet is corrupt, the user might not necessarily notice. For example, if the data packet contains only back half data then the front half data can be shown to the user even though the quality of the image is reduced. Even if a packet containing front half data is corrupt, it may be possible to obtain usable audio sound by performing forward error correction on the audio data.

The previous example illustrated a possible embodiment of the invention. The invention, however, is not limited to the example embodiment as discussed. Some of the possible variations are now presented.

The data format illustrated in the above example illustrated a front half video data consisting of one quarter of the pixels in an image with the back half video data comprising the remaining pixels. Thus a lower quality image (consisting of the front half video) can be extracted without decompressing the entire image. Alternatively, a higher quality image can be displayed by adding the back half video data to the front half video data to receive the image as originally encoded. This arrangement is example of how a subset of the full image data can be extracted with a lower image quality or a smaller size without requiring a full decompression of the image as a preliminary step. Through similar calculations, the remaining data may be requested later which when applied as differences towards the earlier subset data will result in the full quality full sized image. Another example can be found in JPEG 2000 where it is possible extract such a subset of the image data can have either a lower quality image, a smaller image, or a sequence at a lower frame rate.

There are two levels of audio and video in the example. This algorithm can be extended to 3 levels by having a front third, middle third, and back third. The server can send either the first third, the front two thirds, or the whole encoded frame is desired. Other variants, including additional levels, are also possible. As will be apparent to those of skill in the art, television variants, such as 29.98 fps, 30 fps, and 25 fps can be downscaled to 24 frames per second by frame decimation (throwing away frames). 30 is another ideal framerate for storage, as it can be easily used for a lot of downscaled framerates, but there is very little difference in the perception to the average human eye.

Any continuous tone compression algorithm can be substituted for DCT in JPEG. A suggested alternative is Wavelet Image compression, or fractal image compression.

Any audio rate and multispeaker/stereo/mono/subwoofer combination can be used for the high quality and low quality audio signal.

Any rectangular picture size is possible. In particular, 16x9 width to height picture ratios of theatrical releases can be captured using a square pixel or a squashed pixel. Black band removal can be done either on a frame by frame basis, or across the whole video stream.

Any capture rate is possible.

Postbuffering can be done by the relay, so that the last n FRAMEDATA elements are stored. Any new client can have these FRAMEDATA or SUBFRAMEDATA burst across the communication channel at maximum rate to show something while the rest of the information is being prepared. Other data that can be encoded using the present invention includes tertiary information and other multimedia types are available such as text, force feedback cues for e.g. selectively controlling an ancillary multimedia device, closed captioning, etc.

Multiple languages can be encoded and stored within this format, and selectively requested.

Even higher audio representations, such as Dolby Surround sound 5.1 channel AC3 format encoding can be selectively requested as well if high enough bandwidth and audio processing facilities exist at the client end.

The client device can send multiple cues and requests. If the source is encoded appropriately, then multiple angle shots can be stored for either a client controlled pan, or as a client controlled position around a central action point. There is a mechanism for selectively requesting computer generated video streams to be created and presented based on user preferences.

According to another embodiment of the present invention, a method of multimedia transmission comprises: sending a signal from client to server specifying the line conditions for multimedia rendering so that the multimedia data that is supplied can be modified as conditions change. The signal specifies the method by which, the full multimedia data is reduced into a fully presentable subset depending on line conditions, direct user control, and demographic positioning. The methods by which the direct user control of the multimedia data requested, so that the audio can be modified via mixing, equalization, computer controlled text to voice additions, and language selection can be provided by the transmission server. The signal can also specify a demographic of the audience. The signal can also contain encryption and authentication data so that the client is identified and is provided multimedia data in accordance to the exact request of the audience.

The signal is transmitted through an unpredictable and unreliable communication channel in such a way that acknowledgement is required based on time elapsed rather than by amount of information received. The signal is transmitted as full frames of video with sub-sampled redundant sets of audio and text information in such as way that at any time the probability that there is always a form of playable audio of some quality that is available is maximized. The signal includes a decimated picture header so that a simplified rendering device can be constructed.

According to a further embodiment of the present invention, a multimedia compression and coding method comprising: capturing and compressing a video signal using a discrete cosine transform based video compression algorithm similar to JPEG, whereby the information is ordered in the multimedia data stream from top to bottom in sets of interleaved columns rather than left to right in sets of progressive rows. The multimedia data stream has sets of columns interleaved into sparse tiles in a way that allows for fast parsing at the transmitter. The multimedia data stream is also stored using interleaved luminance and chrominance values in YUV4:2:2 format in variably sized picture element processing sets tiles that are greater than 8 by 8 byte matrixes in units that are powers of two, such as but not limited to 64 by 64 matrixes and 128 byte by 128 byte matrixes.

The multimedia data stream is also stored as a lower resolution decimated JPEG image as a header with the required information to reconstruct a higher resolution image stored as a secondary and tertiary set of information in comment blocks of the JPEG, or as additional data elements that may or may not be transmitted at the same time as the header, both put together are termed for this documents as comment type information.

The multimedia data stream has comment type information variably encrypted and authenticated in such a way that the origin of the source, and the legitimacy of the video requester can be controlled and regulated. The multimedia data stream has audio, text, and force feedback information encoded as comment type information within the image file, so that standard picture editing software will parse the file, yet not store or extract the additional multimedia information.

The multimedia data stream has audio encoded with variable sampling rates and compression ratios, and then packaged as comment type information in such a way that a long time period of low quality audio and short periods of higher quality audio is redundantly transmitted. In addition, the multimedia data stream has other types of multimedia information, such as but not limited to text and subtext, language and country specific cues, force feedback cues, control information, and client side 3- d surface model rendering and texture information, program flow elements, camera viewpoint information encoded as comment type information.

According to a still further embodiment of the present invention, a multimedia content creation apparatus comprises software or hardware that can take an industry standard interface for capturing audio, video, and other types of multimedia information, such as but not limited to text and subtext, language and country specific cues, force feedback cues, control information, and client side 3-d surface model rendering and texture information, program flow elements, camera viewpoint information, and then compressing and encoding the information into a multimedia data stream format as described above and then storing the data into a multimedia data store.

According to yet another embodiment of the present invention, a multimedia transmission apparatus comprises a multimedia data store that will, on an authenticated or unauthenticated request, transmit the previously described multimedia data stream to another multimedia transmission apparatus in its entirety. A multimedia transmission relay will, on an authenticated or unauthenticated request, set up a network point that one or many multimedia data store can transmit to, and that one or many multimedia rendering apparatus can request said multimedia data. The apparatus can, based on time specified acknowledgement information, modify the information that is presented by a process of parsing, merging, and filtering in such a way that required information is always sent redundantly, and less important information is removed first, based on a selection criteria specified by the multimedia rendering apparatus. In addition, the apparatus can collect and store information based on the audience demographic, and may or may not modify the multimedia data stream to accommodate visual cues, market based product placement. Information that has already been sent is post-buffered so that at the request from the multimedia rendering apparatus, the missing information can be retransmitted at faster than real time rates. According to yet still another embodiment of the present invention, a multimedia rendering apparatus comprises: a software program or hardware device that can receive, through some communication channel in a timely manner from reception time, the previously mentioned multimedia data stream and will produce a video picture stream and audio stream that can be presented to an audience, the multimedia rendering apparatus can present all other types of multimedia information, such as but not limited to text and subtext, language and country, specific cues, force feedback cues, control information, and client side 3-d surface model rendering and texture information, program flow elements, camera viewpoint information.

The multimedia apparatus can but need not be a stand alone application, a plug in for an existing application, a standalone piece of hardware, or a component for an existing piece of hardware that may or may not have been originally intended for the use of being a multimedia rendering device, but can be easily modified to be such a device.

Advantages of the video compression method and system of the present include:

• Allowing motion picture data to be requested by the display device and said request fulfilled from a source device through an unpredictable and transmission channel adapting to the capabilities of the display device and the current conditions of the transmission channel.

• Allowing motion picture playback to begin at the earliest possible time, which is the time required for one round trip time for data to travel from the reception device, to the data fulfillment device.

• Allowing the selection of audio content and video content "channels" to be selected and switched in the time required for a single successful round trip of data requesting the change in "channel" and the fulfillment of the same request.

• Allowing a user selected horizontal subset of a wide horizontal scene to be panned with the only the data pertaining to the horizontal section of interest to be selectively parsed out from the full data and only this subset of data is to be transmitted.

Advantageously, the video compression method and system according to the invention allows: • multimedia data to be requested by the display device and transmitted through an unpredictable transmission channel adapting to the capabilities of the display device and tire reliability of the communication.

• multimedia data to be encoded in such a way that the rendering of audio can continue in some capacity for a short period of time, at a reduced level in the case when information is sent but not received.

• multimedia data that the system sends to adapt by reducing the amount of data selectively in such a way that the least perceived data, such as high frequency audio, or higher frame rate, and possibly even stereo separation is selectively removed from the transmission first.

• multimedia data to be encoded in such as way that multiple levels of audio and video can be reduced to the required level for that particular display device and the current communications capacity with minimal calculations.

• multimedia data to be encoded such that long term archival storage of the full highest quality video and audio is protected by multiple levels of encryption in a way that the lowest representative audio and video has rriinimal or no protection and the highest representation of audio and video has maximum protection

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

Claims

We claim:

1. A method of encoding a frame of multimedia data from a transmitter to a receiver comprising: encoding a portion of image data of the frame into a first data packet, the first packet for use by the receiver to generate a viewable image upon receipt thereof; and encoding the remainder of the image data into a second data packet, the second packet for use by the receiver to generate, in conjunction with the first data packet, an enhanced version of the viewable image.

2. A method of receiving multimedia data from a transmitter comprising: receiving a first data packet, the first data packet for use by the receiver to generate a viewable image; receiving a second data packet, the second data packet for use by the receiver, in conjunction with the first data packet, to generate an enhanced version of the viewable image; and sending to the transmitter a request for additional packets.

3. The method of claim 2 wherein the request further comprises transmission parameters.

4. The method of claim 3 wherein the transmission parameters are dependent upon the conditions of the medium of transmission.

5. The method of claim 4 wherein the medium of transmission is the Internet.

6. The method of claim 3 wherein the transmission parameters include the conditions of the display surface in size and in visibility.

7. The method of claim 3 wherein the transmission parameters include user selected feedback to select areas, indicate directions, or indicate decisions.

8. A method of encoding a frame of multimedia data from a transmitter to a receiver comprising: encoding a portion of image data of the frame into a first data packet, the first packet for use by the receiver to generate a viewable image upon receipt thereof; encoding the remainder of the image data into a second data packet, the second packet for use by the receiver to generate, in conjunction with the first data packet, an enlianced version of the viewable image; and encoding audio data into a set of packets which are capable of being used by the receiver to generate an audio signal corresponding to an audio track of said multimedia data;

9. The method of claim 8 further comprising encoding tertiary information in the set of packets.

10. The method of claim 8 further comprising encoding tertiary information in a second set of packets.

11. The method of claim -8 wherein the set of packets includes at least one of the first and second data packets.

12. The method of claim 8 wherein the tertiary information comprises force feedback cues for selectively controlling an ancillary multimedia device.

13. A method of receiving and rendering multimedia data by a receiver from a transmitter through an unacknowledged unreliable packet based transmission medium: receiving at least one data packet containing audio data and image data corresponding to at least . one image; reproducing required audio information that is encoded by the audio data; displaying the at least one image at the correct time to synchronize with the audio data; and sending to the transmitter a periodic request for additional packets based a prioritised list of required packets as part of a set of transmission parameters.

14. The method of claim 13 wherein at least one data packet comprises a first packet containing data corresponding to a portion of at least one image, the first packet for use by the receiver to generate a viewable image and wherein the displayed at least one image is the viewable image.

15. The method of claim 13 wherein at least one data packet comprises a first packet containing data corresponding to a portion of at least one image and a second packet containing data corresponding to the remainder of the at least one image, the first packet for use by the receiver to generate a viewable image and the second packet for use by the receiver for generating, in conjunction with the first packet to generate an enhanced viewable image and wherein the displayed at least one image is the enhanced viewable image.

16. The method of claim 15 wherein the enhanced viewable image is larger than the viewable image.

17. The method of claim 15 wherein the enhanced viewable image is better quality than the viewable image.

18. The method of claim 13 further wherein at least one data packet further comprises ancillary multimedia data and the method further comprises reproducing any ancillary multimedia data.

19. The method of claim 13 wherein the request comprises an acknowledgement.

20. The method of claim 13 wherein the request calculates the prioritised list of required packets to display a presentation at uniform quality, uniform framerate, and uniform resolution, but not necessarily the best quality, the best framerate, or the best resolution;

21. The method of claim 13 wherein the prioritised list of required packets favours a continuous supply of audio information before having the best quality or highest framerate, or largest resolution image display.

22. The method of claim 13 wherein the prioritised list of required packets is dependent upon the conditions of the medium of transmission;

23 . A method of multimedia compression, comprising the steps of, for each frame: capturing image data for a frame; capturing audio data for the frame; compressing the image data; and encoding the audio data within the comment field of the compressed image data; 24 . A method of multimedia compression, comprising: capturing raw audio data for a frame; converting the raw audio data to provide back half audio data and front half audio data; capturing raw video data for a frame; flipping the frame diagonally to provide flipped video data; collecting every other pixel in the flipped video data; encoding the remaining uncollected pixels to provide back half video data; converting the collected pixels to YUV space to provide front half video data; and comparing and storing the back half data using a continuous tone compression algorithm. 25. A system for implementing the method according to claim 23. A system for implementing the method according to claim 24.