WO2005041455A1

WO2005041455A1 - Video content detection

Info

Publication number: WO2005041455A1
Application number: PCT/IB2003/050031
Authority: WO
Inventors: Jozef P. Van Gassel; Declan P. Kelly; Jan A. D. Nesvadba
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2002-12-20
Filing date: 2003-12-03
Publication date: 2005-05-06
Also published as: AU2003283783A1

Abstract

A method of detecting content in a multimedia signal, the method comprising the steps of providing (201) a multimedia signal comprising video data and corresponding audio data; determining (202) an audio fingerprint data item from a predetermined part of the audio data; comparing (203) the determined audio fingerprint data item with at least one of a number of reference audio fingerprint data items each related to a corresponding content element; and if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, identifying (205) the corresponding related content element as detected.

Description

Description Video content detection

[1] The invention relates to the detection of content, e.g. commercials, in multimedia signals such as multimedia data streams.

[2] Personal video receivers/recorders, devices that receive, process and/or record the content of broadcast video, are becoming increasingly popular. Modern hard disk based personal video recorder (PVR) devices that are currently available in the market (e.g. TiVo, UltimateTV, EchoStar and ReplayTV) use so-called electronic program guides.

[3] An electronic program guide (EPG) is an application used with digital set-top boxes, modern television sets, video recorders, etc. in order to list current and scheduled programs that are or will be available on each channel together with a short summary or commentary for each program. Hence, an EPG is the electronic equivalent of a printed television program guide.

[4] Typically, an EPG is accessed using a remote control device. Menus are provided that allow the user to view a list of programs scheduled for the next few hours up to the next seven days. A typical EPG includes options to set parental controls, to order pay- per-view programming, to search for programs based on theme or category, and to set up a VCR to record programs.

[5] In the context of video recording, an EPG may be used, sometimes in combination with the personal preferences or profile of the user, to (automatically) schedule recordings of programs selected from a wide range of television channels. This approach of recording television broadcasts has emerged, because of the random accessibility of hard disc drives, thereby creating a number of interesting possibilities, such as the simultaneous recording and playback of programs, the simultaneous recording of multiple programs, etc. Furthermore, the huge storage capacity of current hard disk drives and the availability of consumer priced video encoders are of key importance as well.

[6] These systems, however, suffer from a number of shortcomings inherent to their way of operation. If an EPG is available on a video recording device, the scheduling of programmed recordings heavily relies on the accuracy of the EPG. In the situation where scheduled programs are delayed or broadcasted earlier than advertised in the EPG, the programmed recording is disrupted, unless the EPG information consulted by the recording device is updated in due time. Another cause of annoyance to the user is the interruption of an intended recording by commercial blocks or other inserted programs, e.g. weather forecasts or news bulletins.

[7] Furthermore, if for some reason the broadcaster decides to broadcast a program of interest to the user, e.g. as defined by a stored preference list, a user profile, or the like, the PVR is not able to record it unless it is specifically added to the EPG. However, it is a general problem that EPGs are rarely up-to-date to the last minute, because they are often compiled by a third party, i.e. not necessarily the broadcaster, or the PVR box manufacturer. The updating process typically takes place via a dial-up connection, i.e. it is only updated periodically. Another complication is the fact that many broadcasters are still broadcasting their programs using analog technologies. Consequently, the start and ending of programs are not explicitly or incorrectly signaled by the broadcaster.

[8] One of the features under investigation for such systems is content detection. For example, a system that can detect commercials may allow substitute advertisements to be inserted in a video stream ("commercial swapping") or temporary halting of the video at the end of a commercial to prevent a user, momentarily distracted during a commercial, from missing any of the main program content.

[9] There are known methods for detecting commercials. One method is the detection of high cut rate due to a sudden change in the scene with no fade or movement transition between temporally-adjacent frames. Cuts can include fades so the cuts do not have to be hard cuts. A more robust criterion may be high transition rates. Another indicator is the presence of a black frame (or monochrome frame) coupled with silence, which may indicate the beginning of a commercial break. Another known indicator of commercials is high activity, an indicator derived from the observation/ assumption that objects move faster and change more frequently during commercials than during the feature (non-commercial) material. These methods show somewhat promising results, but reliability is still wanting.

[10] Hence, it is an object of the present invention to solve the problem of providing an accurate method of content detection.

[11] The above and other problems are solved by a method of detecting content in a multimedia signal, the method comprising:

[12] - providing a multimedia signal comprising video data and corresponding audio data;

[13] - determining an audio fingerprint data item from a predetermined part of the audio data;

[14] - comparing the determined audio fingerprint data item with at least one of a number of reference audio fingerprint data items each related to a corresponding content element; and

[15] - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, identifying the corresponding related content element as detected.

[16] Hence, by identifying content in a multimedia signal based on a comparison of audio fingerprints with previously determined and stored reference audio fingerprints, a robust and efficient method of identifying content in a multimedia signal is provided. Hence, a method is provided for robustly recognizing the specific content of the received multimedia signal, e.g. the start or finish of a specific television program, such as a specific television show, a specific news program, the start or finish of commercials, or the like.

[17] It is a further advantage of the invention that it provides a computationally efficient method of identifying content in a multimedia signal, since it is based on the processing of the audio part of a multimedia signal. Audio processing, e.g. the calculation and comparison of audio fingerprints, is less computationally complex than the processing of the actual video part of a multimedia signal.

[18] For the purpose of the current description, multimedia signals comprise a video part representing (moving) pictures and an audio part representing the associated audio content. The visual content and the audio content may be encoded in a multimedia signal by a number of different encoding schemes. For example, the multimedia signal may represent a sequence of picture frames that are grouped into blocks, where each block comprises a number of frames and the associated audio data. Known encoding schemes for multimedia signals include the schemes provided by the Moving Pictures Experts Group (MPEG).

[19] For the purpose of the present description, the term multimedia signal refers to an analog or digital signal or data comprising the actual video content and the associated audio content to be presented together, preferably synchronized, with the video content. The data representing the video content alone will be referred to as video data, while the data representing the audio content will be referred to as audio data.

[20] It has been realized by the inventors that the audio fingerprints provide a particularly reliable method of recognizing specific content in a multimedia signal. For example, it has been observed that e.g. audio trailers of news programs, television shows, etc., remain unchanged over long periods of time and, therefore, they provide a reliable source for recognizing these programs.

[21] Here, the term audio fingerprint comprises any suitable method of extracting robust features from audio data indicative of the audio content in such data, and storing the extracted features in a compact form. Hence, an audio fingerprint is a representation of the corresponding audio content in question. Preferably, the fingerprint is shorter than the original audio signal. Furthermore, preferably the fingerprint represents the most relevant perceptual features of the audio signal in question. Such fingerprints are sometimes also known as "robust hashes". The term robust hashes refers to a hash function which, to a certain extent, is robust with respect to data processing and signal degradation, e.g. due to compression/decompression, coding, AD/DA conversion, etc. Robust hashes are sometimes also referred to as robust summaries, robust signatures, or perceptual hashes.

[22] The term content element comprises any fragment of a video program, e.g. a trailer, a leader, a jingle, or the like. A content element may further comprise the entire content of a video program, e.g. an entire commercial, or the like.

[23] ^"" According to the invention, the'audio fingerprints of a large number of video programs are stored, e.g. in a database. Hence, the content in a multimedia signal is recognized by computing an audio fingerprint of the associated audio content and by performing a lookup or query in the database using the computed fingerprint as a lookup key or query parameter. It is understood that more than one fingerprint may be associated with a given content element.

[24] The reference audio fingerprints may be stored in a database locally in the device, e.g. a PVR, thereby allowing efficient content identification by the device without the need for establishing a communications link to a remote fingerprint server.

[25] Alternatively, the matching of the computed fingerprints against reference fingerprints may be done at a remote location, for example on a server connected to the Internet. In this embodiment, the client device computes the fingerprint and sends it to the server, which returns a content identifier. It is understood, that a combination of a remote and a local database may be used too, e.g. for supplementing a remote database with a personal local database of fingerprints related to content of personal interest of the user.

[26] In one embodiment, the extracted fingerprint data items may be added to the set of reference audio fingerprint data items, e.g. subject to an approval by a user, thereby gradually extending the set of reference audio fingerprints.

[27] In a preferred embodiment, the method further comprises:

[28] - storing each of the number of reference audio fingerprint data items in relation to a corresponding video content identifier; and

[29] - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, retrieving the video content identifier corresponding to the identified first reference audio fingerprint data item.

[30] Hence, the audio fingerprint data items are stored along with a video content identifier, i.e. an identification of their respective content, e.g. the title of the program, a number identifying the program, or the like. Accordingly, a lookup in the database returns the stored video content identifier indicative of the content of the recorded video program.

[31] According to another preferred embodiment of the invention, the method further comprises:

[32] - recording a video program resulting in a multimedia signal;

[33] - identifying at least a predetermined part of the video program corresponding to a predetermined content element; and

[34] - generating at least one audio fingerprint data item corresponding to the predetermined part of the video program.

[35] It is understood that a predetermined part of a video program may comprise the entire program.

[36] According to a further preferred embodiment, the method further comprises: [37] - comparing the generated audio fingerprint data item with at least one previously generated audio fingerprint data item to generate viewing frequency information indicative of a number of times a corresponding content element has previously been presented to a user; and

[38] - storing the generated audio fingerprint data item in relation to the generated viewing frequency information.

[39] Hence, it is an advantage of the invention, that information about how often a given video content has been watched by a user may be stored. Based on this data, the PVR can compile a 'hot list' of frequently occurring audio fingerprint information derived from the recorded programs that are previously stored/recorded onto the hard disk of the PVR. Using this hot list, the PVR can automatically record such programs in the future, even for programs and channels that do not have an entry associated with them in the EPG. At the same time this information as derived from the contents stored on the hard disk of the PVR can be used to improve the profile of the user.

[40] Alternatively or additionally, such a list of frequently occurring fingerprints may be used to trigger other types of decisions, e.g. to control the PVR not to record the video content corresponding to selected ones of the frequent fingerprints, as they may relate to commercials or the like.

[41] There are several advantages in storing audio fingerprints in a database instead of the multimedia signal itself. To name a few:

[42] - The memory/storage requirements for the database are reduced.

[43] - The comparison of fingerprints is more efficient than the comparison of the multimedia signal, as fingerprints are substantially shorter than the signals they are calculated from.

[44] - Searching in a database for a matching fingerprint is more efficient than searching for a complete video signal, since it involves matching shorter items.

[45] - Searching for a matching fingerprint is more likely to be successful, as small changes to a video signal (such as encoding in a different format or changing the bit rate) do not affect the fingerprint.

[46] An example of a method of generating an audio fingerprint is described in Jaap Haitsma, Ton Kalker and Job Oostveen, "Robust Audio Hashing For Content Identification", International Workshop on Content-Based Multimedia Indexing, Brescia, September 2001, which discloses the computation of audio fingerprints and the obtaining of identifiers from them as such.

[47] Preferably, the audio fingerprints are calculated for certain characteristic parts of a video program only, e.g. leaders and/or trailers of video programs, thereby reducing the amount of data that has to be calculated, stored, and compared.

[48] According to a preferred embodiment of the invention, the method further comprises controlling a video recording device in response to the retrieved video content identifier. [49] It is an advantage of the invention that it provides an improved control of a video recording device. In particular, it is an advantage of the invention that it provides an improved accuracy of commercial detection. In a further preferred embodiment, the method of identifying content in a multimedia signal according to the invention is combined with other commercial detection algorithms, thereby significantly improving the reliability and accuracy of existing algorithms. Usually commercial blocks are separated from the normal program by a leader and trailer that signal the start and end of these blocks, respectively. Since these leaders and trailers only rarely change (typically at most once a season) they are well suited for identification by the audio fingerprinting according to the invention.

[50] According to another preferred embodiment of the invention, the method further comprises communicating information about the detected content element to a remote data processing system. Hence, the invention provides a mechanism for providing feedback information about the viewed programs by the user of a PVR to a third party.

[51] The present invention can be implemented in different ways including the method described above and in the following, further methods and arrangements, a video recorder, and further product means, each yielding one or more of the benefits and advantages described in connection with the first-mentioned method, and each having one or more preferred embodiments corresponding to the preferred embodiments described in connection with the first-mentioned method and disclosed in the dependant claims.

[52] A second aspect of the invention relates to a method of recording a video program by a video recording device. One of the interesting features of modern PVRs is the possibility of a pre-programmed recording of a series of programs, e.g. a complete set of episodes of a television series. However, such programs do not always start and stop exactly at the times indicted in the (electronic) program guide. This can be caused by a number of different reasons, such as cancellation, delays or radio interference. Furthermore, programs may be longer than anticipated, e.g. due to inserted news flashes or extra-time in sports broadcasts.

[53] The above problem is solved by a method of recording a video program by a video recording device, the method comprising:

[54] - detecting a content element in a multimedia signal corresponding to a predetermined part of the video program by performing the steps of the first-mentioned method;

[55] - controlling a recording operation of the video recording device in response to the detected content element.

[56] Hence, by detecting the content corresponding to a predetermined part of a video program, such as leaders and trailers according to the invention, recorded programs can be assured to be complete, and they do not take up more disc space than strictly necessary. Hence, the accuracy of the start- end end-time of programmed recordings is improved significantly.

[57] It is a further advantage of the invention that it facilitates the recording of programs that are not scheduled and, consequently, not announced in the (Electronic) Program Guide at all. Such unannounced content can nevertheless be very interesting to the users. Examples of such unannounced content include news flashes, weather forecasts and stock market updates that might be broadcasted at random intervals during the day (without explicitly being listed in the EPG).

[58] According to yet another aspect of the invention, the first-mentioned method may be used in a PVR as a tool to provide feedback information about the viewed programs by the user of the PVR to a third party. For example, the method may be applied as a tool to market products offered in commercials or during regular programs, thereby enabling e-commerce applications.

[59] Accordingly, the invention relates to a method of communicating information about a content element of a video program , the method comprising:

[60] - receiving an audio fingerprint data item generated by a device for presenting multimedia content, the audio fingerprint data item representing a predetermined content element of the presented multimedia content;

[61] - providing a number of reference audio fingerprint data items each related to a corresponding content element;

[62] - comparing the received audio fingerprint data item with at least one of the number of reference audio fingerprint data items; and

[63] - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, identifying the corresponding related content element as detected.

[64] For example, as soon as an item of interest is shown to a user within a video program by a device for presenting multimedia content, such as a PVR, a television set, or the like, the user can initiate that an audio fingerprint data item is sent back to the provider of the television program or to a third party, thereby showing his/her interest in this item. There is no additional data required to be sent with the audio/ video stream to identify the product. Consequently it is an advantage that no hardware is required at the head-end, broadcaster, or other provider of the video program to create such data and insert it into the audio/video stream. Furthermore, there is no longer a need to involve the broadcaster in this e-commerce chain. The transaction can be limited to be performed between the end-user and the third party, e.g. via the Internet or another communications channel, e.g. a telecommunications link.

[65] It is noted that the features of the methods described above and in the following may be implemented in software and carried out in a data processing system or other processing means caused by the execution of computer-executable instructions. The instructions may be program code means loaded in a memory, such as a RAM, from a storage medium or from another computer via a computer network. Alternatively, the described features may be implemented by hardwired circuitry instead of software or in combination with software.

[66] The invention further relates to an arrangement for detecting content in a multimedia signal, the arrangement comprising:

[67] - means for providing a multimedia signal, the multimedia signal comprising video data and corresponding audio data;

[68] - processing means for determining an audio fingerprint data item from a predetermined part of the audio data;

[69] - processing means adapted to compare the determined audio fingerprint data item with at least one of a number of reference audio fingerprint data items each related to a corresponding content element and, if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, to identify the corresponding related content element as detected.

[70] Here and in the following, the term processing means comprises general- or special-purpose programmable microprocessors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Programmable Logic Arrays (PLA), Field Programmable Gate Arrays (FPGA), special purpose electronic circuits, etc., or a combination thereof.

[71] Examples of storage means include magnetic tape, optical disc, digital video disk (DVD), compact disc (CD or CD-ROM), mini-disc, hard disk, floppy disk, ferroelectric memory, electrically erasable programmable read only memory (EEPROM), flash memory, EPROM, read only memory (ROM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), ferromagnetic memory, optical storage, charge coupled devices, smart cards, PCMCIA card, etc.

[72] The arrangement may further comprise means for providing a number of reference audio fingerprint data items. Such means may comprise storage means for storing such data items, communications means for receiving such data items, or any other circuitry or device suitable for providing such data items.

[73] The means for providing a multimedia signal may comprise any circuitry or device for receiving a multimedia signal, such as a receiver, e.g. a television receiver, a satellite receiver, a storage means for multimedia signals, or any other suitable communications means.

[74] The arrangement may further comprise communications means for communicating information about the detected content element, e.g. a retrieved video content identifier, to a remote data processing system.

[75] Here and in the following, the term communications means comprises circuitry and/or devices suitable for enabling the communication of data between, e.g. via a wired or a wireless data link. Examples of such communications means include a network interface, a network card, a radio transmitter/receiver, a cable modem, a telephone modem, an Integrated Services Digital Network (ISDN) adapter, a Digital Subscriber Line (DSL) adapter, a satellite transceiver, an Ethernet adapter, or the like. [76] The invention further relates to a video recorder comprising such an arrangement.

[77] The invention further relates to a system for communicating information about a content element of a video program, the system comprising: [78] a device for presenting a multimedia signal, the multimedia signal comprising video data and corresponding audio data, the device for presenting a multimedia signal comprising [79] - processing means for determining an audio fingerprint data item from a predetermined part of the presented audio data; [80] - communications means for transmitting the determined audio fingerprint data item; [81] a data processing system comprising

[82] - communications means for receiving the transmitted audio fingerprint data item;

[83] - storage means having stored thereon a number of reference audio fingerprint data items each related to a corresponding content element; [84] - processing means adapted to compare the received audio fingerprint data item with at least one of the number of reference audio fingerprint data items, and, if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, to identify the corresponding related content element as detected. [85] The invention further relates to a device for presenting a multimedia signal in such a system and a data processing system in such a system. [86] These and other aspects of the invention will be apparent and elucidated from the embodiments described in the following with reference to the drawing in which: [87] fig. 1 shows a block diagram of a video recorder including an arrangement for detecting content in a multimedia signal according to an embodiment of the invention; [88] fig. 2 shows a flow diagram of a method of detecting content in a multimedia signal according to an embodiment of the invention; [89] fig. 3 shows a block diagram of a video recorder comprising an arrangement for generating reference audio fingerprint data items according to an embodiment of the invention; [90] fig. 4 schematically shows a fingerprint database module according to an embodiment of the invention; and [91] fig. 5 shows a system for communicating information about multimedia content according to an embodiment of the invention. [92] Fig. 1 shows a block diagram of a video recorder comprising an arrangement for detecting content in a multimedia signal according to an embodiment of the invention. The video recorder 101 receives a multimedia data stream 107. The multimedia data stream comprises video data representing moving pictures and audio data representing corresponding audio data. The multimedia data may be received by a receiver, e.g. a digital or analog receiver for receiving television programs or other multimedia content, e.g. via an antenna, a cable network, a satellite, or another communications network, such as the Internet. The multimedia data may be coded according to any suitable coding scheme, e.g. an MPEG scheme. It is noted that the multimedia data may comprise a plurality of parallel data streams corresponding to a plurality of video channels or the like.

[93] The video recorder 101 comprises a recorder block 102 which stores received multimedia data 107 onto a storage medium 103. The storage medium 103 may be a hard disk or any other suitable storage means.

[94] It is noted that the recorder may record one or more video programs at the same time.

[95] According to the invention, the video recorder 101 further comprises an audio fingerprint calculation block 104 which receives the input data and computes one or more audio fingerprints from the audio component of the input data.

[96] The video recorder further comprises a fingerprint database module 105 which receives the calculated audio fingerprint item(s) from the audio fingerprint calculation block 104. The fingerprint database module 105 has further access to a fingerprint database 106 which comprises a number of reference fingerprint data items. Based on a comparison of the calculated audio fingerprint item(s) and the reference fingerprint data items in the database 106, the fingerprint database module 105 controls the recording operations of the recorder block 102, e.g. by starting a recording, stopping, pausing, resuming a recording, or the like.

[97] It is noted that, instead of a database 106, the reference fingerprints may be stored in a different way, e.g. as files in a file systems. It is an advantage of a database system that it allows an efficient search when a large number of reference fingerprints are stored.

[98] It is understood that, in one embodiment, the multimedia input data 107 may originate from the storage medium 103, i.e. the multimedia data may be previously stored program material which is to be re-editted, e.g. in order to remove commercials or other unwanted material, and the processed data is stored on the storage medium 103.

[99] Fig. 2 shows a flow diagram of a method of detecting content in a multimedia signal according to an embodiment of the invention. In an initial step 201 a recording device receives a segment of an input signal representing multimedia data comprising video data and audio data. For example, the received segment may comprise a predetermined number of video frames and the corresponding audio data. In step 202, a fingerprint H is calculated for a segment of the audio data of the received multimedia data.

[100] In step 203, the calculated fingerprint is compared to the reference fingerprints stored in a database 106, e.g. by performing a database query using the calculated fingerprint H as a key. If no matching reference fingerprint is found, i.e. no content is recognized in the multimedia data, the process returns to step 201 receiving a next segment of the input signal. If a match is found, a predetermined content corresponding to the matching reference audio fingerprint is determined to be detected (step 204) in the multimedia data, and the process continues at step 205.

[101] In step 205, additional data corresponding to the identified content is retrieved from the database 106. In one embodiment this information comprises a content identifier, e.g. a title of a video program or another suitable identifier identifying a specific program, a specific series or type of programs, e.g. identifying the content as an episode of a specific series or show, as a news flash of specific news program, as a specific commercial for a specific product, or the like.

[102] Alternatively or additionally, the additional data comprises control code information indicative of a predetermined operation to be initiated in response to the detected content, such as displaying additional information, starting, stopping, etc. of a recording, etc. The operation may further be conditioned on a explicit acknowledgment by a user.

[103] In step 206, the operation of the video recorder is controlled according to the detected content. Examples of the control of operations include the pausing of a recording at the beginning of a detected commercial, news program, etc., which interrupts the current program, and resuming the recording at the end of the commercial or the like. This skipping of (single) commercials can also be applied in live viewing situations. For fine tuning, i.e. determining a frame-accurate in— and outpoint to pause and resume the recording, additional commercial detection, scene- change and/or genre change detection algorithms can be used in combination with the present invention. It is understood, that a commercial enforcement may also be implemented.

[104] By detecting the leaders and trailers of the program to be recorded and of the program material to be skipped, it can be assured that the program to be recorded is recorded in full, even if the start and/or finish times differ from the announced times. Furthermore, unwanted program material may efficiently be excluded from the recording. Hence, recorded programs do not take up more disc space than strictly necessary.

[105] If the process is not stopped (step 207), the process continues at step 201 and receives the next signal segment.

[106] Fig. 3 shows a block diagram of a video recorder comprising an arrangement for generating reference audio fingerprint data items according to an embodiment of the invention. As described above, the recognition of content in a multimedia signal is based on a comparison with a number of reference audio fingerprints, e.g. of leaders and trailers of commercial blocks and/or programs of interest, available in the database 106. The video recorder 101 of fig. 3 comprises an arrangement for acquiring such information. The video recorder 101 receives input multimedia data 107, e.g. during recordings or during a separate configuration or training session. The video recorder comprises a recorder block 102 that stores a recording of a predetermined program material on a storage medium 103. The video recorder further comprises a fingerprint calculation block 104 that receives the input data and computes audio fingerprints from the audio component of the input data. The generated audio fingerprint information is written as a separate stream or file to the storage medium 103. Alternatively, the fingerprint information may be embedded in private data of the encoded multimedia stream. The video recorder further comprises a fingerprint management block 301 which retrieves previously recorded multimedia data and the corresponding fingerprints from the storage medium 103 and identifies the fingerprint information associated with leaders and trailers of commercials and other programs. The identified fingerprint information is stored in the fingerprint database 106.

[107] In one embodiment, this identification of relevant fingerprint information is performed by comparing the beginning and ending of different recordings of the same program, e.g. of different episodes of a television series or show or of different news bulletins. In one embodiment, the fingerprint management block 301 provides a user interface allowing a user to select a number of stored recordings to be used as a basis for identifying fingerprint data. The identification of key program segments providing recognizable program content which is indicative for a given program may be performed automatically, e.g. by correlating the fingerprint information of different episodes. In another embodiment, the fingerprint management block 301 may provide a user interface allowing a user to explicitly select such key fragments of a program. In yet a further embodiment, an automatic and a manual identification of key fragments are combined, e.g. by requesting a confirmation from the user on fragments identified by the video recorder. It is noted that the present invention may be combined with other automatic commercial detectors in order to identify the boundaries of the commercials and/or the boundaries of the single commercial clips to be stored in the fingerprint database.

[108] The fingerprint management block 301 further provides a user interface allowing a user to input additional data, such as a content identifier or other descriptive information about the content of the program material. As mentioned above, the additional data may further comprise control code information, e.g. an indication whether the identified content is to be recorded whenever detected, whether the identified content is to be skipped during recordings, etc.

[109] For example, in one embodiment, the method according to the invention may be used to implement a commercial skip on a commercial-by-commercial basis. Some people like to watch commercials and only tend to get bored after repeated viewings. Hence, a commercial 'thumbs down' button may be supplied allowing the user to disqualify a specific commercial that will be skipped automatically in the future.

[110] In one embodiment, the fingerprint management block 301 can compile a 'hot list' of frequently occurring audio fingerprints derived from the recorded programs that are previously stored on the storage medium 103. Using this hot list, a number of decisions are facilitated. For example, the video recorder may be controlled to automatically record such programs in the future, even programs and channels that do not have an entry associated with them in the EPG. At the same time this information as derived from the hard disk contents can be used to improve the profile of the user.

[I l l] It is noted that alternative methods of acquiring reference fingerprint data may be employed. For example, the fingerprint information may be made available by service providers, e.g. on a web-site of the Internet. Thus, the reference fingerprint information may be accessed on-line via the Internet, or the fingerprint information may be downloaded and stored in the video recorder, e.g. embedded into the EPG. Alternatively or additionally, fingerprints may be distributed on a storage medium, such as on a disc, e.g. a CD, DVD, etc.

[112] Fig. 4 schematically shows a fingerprint database module according to an embodiment of the invention. The fingerprint database module 105 comprises an input module 401, a Database Management System (DBMS) backend module 403, and a response module 404.

[113] The input module 401 receives an audio fingerprint and supplies the fingerprint to the DBMS backend module 403. The DBMS backend module 403 performs a query on the database 106 to identify any matching reference fingerprints and to retrieve any additional data associated with the matching reference fingerprint. As shown in Fig. 4, the database 106 comprises fingerprints FP1, FP2, FP3, FP4 and FP5 and respective associated sets of additional information Dl, D2, D3, D4 and D5. International patent application WO 02/065782, which is included herein by reference in its entirety, describes various matching strategies for matching fingerprints computed for an audio clip with fingerprints stored in a database as well as an efficient method of matching a fingerprint representing an unknown information signal with a plurality of fingerprints of identified information signals stored in a database to identify the unknown signal. This method uses reliability information of the extracted fingerprint bits. The fingerprint bits are determined by computing features of an information signal and thresholding said features to obtain the fingerprint bits. If a feature has a value very close to the threshold, a small change in the signal may lead to a fingerprint bit with opposite value. The absolute value of the difference between feature value and threshold is used to mark each fingerprint bit as reliable or unreliable. The reliabilities are subsequently used to improve the actual matching procedure.

[114] The database 106 can be organized in various ways to optimize query time and/or data organization. The output from the input module 401 should be taken into account when designing the tables in the database 106. In the embodiment shown in Fig. 4, the database 106 comprises a single table with entries (records) comprising respective fingerprints and sets of additional data. The DBMS backend module 403 feeds the results of the query to the response module 404, which returns the results to a requesting application or directly generates a control signal for controlling a device, e.g. a video recorder, in response to the retrieved additional information.

[115] In one embodiment, each reference audio fingerprint data item is stored together with an associated control code, where each control code causes a video recorder to perform a specific action, such as starting a recording, stopping an ongoing recording, etc, thereby allowing a control of a video recorder based on detected content in a current video program.

[116] Fig. 5 shows a system for communicating information about multimedia content according to an embodiment of the invention. The system comprises a set-top box 501 which receives a multimedia data stream 502, e.g. via a cable network, a satellite, a communications network, the Internet, or the like. The set-top box comprises a control unit 503 which feeds the multimedia data to a television set 512 for presentation to a user. The set-top box further comprises a user interface module 505 for providing a user interface allowing a user to select programs to be viewed, etc.

[117] According to the invention, the set-top box 501 further comprises a fingerprint calculation module 504 which receives the input multimedia data stream 502 and computes audio fingerprint information from the audio data of the received input. As soon as an item of interest is shown in the program, the user can, via the user interface 505, initiate the transmission of the computed fingerprint data for a selected program fragment to a service provider or other third party, thereby indicating his/her interest in the displayed item.

[118] Consequently, according to this embodiment, the set-top box 501 comprises a communications interface 506, e.g. a modem, a network adapter, or the like, for transmitting the computed and selected fingerprint data to a service provider system 507, e.g. a network server or other data processing system. Additionally, further information may be transmitted, such as the identification of the sending set-top box, an indication of the type of interest, e.g. a request for further information, a purchase order, etc.

[119] The service provider system 507 comprises a corresponding communications interface 508 for receiving the fingerprint data along with any additional information transmitted by the set-top box 501. The service provider system further comprises a fingerprint database module 509 and a fingerprint database 510 as described above. The fingerprint database module 509 compares the received fingerprint data item with reference fingerprints in the database 510. If a matching reference fingerprint is found, the provider system 507 initiates a suitable transaction in response to the recognized item viewed by the user of the set-top box. For example, the provider system may send additional information, initiate a purchase transaction, send a control signal back to the set-top box, thereby causing the set-top box to display a suitable menu, or the like.

[120] It is an advantage of this embodiment that third-party product vendors or service providers can offer an extra service to the end-user of the set-top box 501 without having to establish a costly infrastructure. This extra service is based on the recognition of content being broadcasted by the broadcaster during regular broadcasts on existing channels without the broadcaster having to include any special metadata in the broadcast. In fact the broadcaster is not involved in the resulting e-commerce transaction which is limited between the end-user and the service provider. The content is recognized by the service provider, e.g. set-top-box provider, based on audio fragments, thereby simplifying the required infrastructure. In particular, since no additional data needs to be sent with the multimedia stream 502, no hardware at the broadcaster of the multimedia data stream for creating such data and inserting it into the multimedia stream.

[121] Based on the content the user is watching, the service provider may cause e.g. an e- commerce application to be launched by the set-top box, thereby providing the user with the option of buying an item that is featured in a commercial or other broadcast, or of engaging in some other e-commerce transaction.

[122] It is noted that, alternatively, the above functionality may be implemented by a video recorder or a television set instead of a set-top box.

[123] It is noted that the above arrangements may be implemented as general- or special- purpose programmable microprocessors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Programmable Logic Arrays (PLA), Field Programmable Gate Arrays (FPGA), special purpose electronic circuits, etc., or a combination thereof.

[124] It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.

[125] For example,the invention is not limited to video recorders but may also be implanted in other devices or systems for processing multimedia data, such as set-top boxes, television sets, multimedia data viewers implemented in software or hardware, or the like.

[126] Furthermore, the invention is not limited to commercials but can easily be applied to other program material to be recorded and/or skipped from recording, in particular program material comprising identifiable fragments, such as leaders and trailers. Examples of such program material comprise inserted news programs, weather forecasts, episodes of television shows, etc.

[127] In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. [128] The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

[129] Hence, in the above, methods and systems for the detection of content in a multimedia signal have been disclosed that significantly improve the commercial detection, programming accuracy and overall functionality of PVRs. These advantages may be achieved even in the absence of EPGs, using analog broadcasts and/or on conventional devices like Video Cassette Recorders (VCRs). Furthermore, it should be noted that the above applications of content detection may also be applied to audio broadcasts, e.g. radio broadcasts, since the content detection is done on the basis of the audio signal.

Claims

[1] A method of detecting content in a multimedia signal, the method comprising: - providing a multimedia signal comprising video data and corresponding audio data; - determining an audio fingerprint data item from a predetermined part of the audio data; - comparing the determined audio fingerprint data item with at least one of a number of reference audio fingerprint data items each related to a corresponding content element; and - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, identifying the corresponding related content element as detected.

[2] A method according to claim 1, further comprising: storing each of the number of reference audio fingerprint data items in relation to a corresponding video content identifier; and - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, retrieving the video content identifier corresponding to the identified first reference audio fingerprint data item.

[3] A method according to claim 1 or 2, further comprising - recording a video program resulting in a multimedia signal; - identifying at least a predetermined part of the video program corresponding to a predetermined content element; and - generating at least one audio fingerprint data item corresponding to the predetermined part of the video program.

[4] A method according to claim 3, further comprising - comparing the generated audio fingerprint data item with at least one previously generated audio fingerprint data item to generate viewing frequency information indicative of a number of times a corresponding content element has previously been presented to a user; and - storing the generated audio fingerprint data item in relation to the generated viewing frequency information.

[5] A method according to any one of claims 1 through 4, wherein the predetermined part of the provided multimedia signal corresponds to at least one of a leader and a trailer of a predetermined video program.

[6] A method according to any one of claims 1 through 5, wherein the step of determining an audio fingerprint data item comprises calculating a robust hash value from a predetermined part of the audio content represented by the provided multimedia signal.

[7] A method according to any one of claims 1 through 6, wherein the method further comprises controlling a video recording device in response to the retrieved video content identifier.

[8] A method according to any one of claims 1 through 7, wherein the method further comprises communicating information about the detected content element to a remote data processing system.

[9] A method of recording a video program by a video recording device, the method comprising: - detecting a content element in a multimedia signal corresponding to a predetermined part of the video program by performing the steps of the method according to any one of claims 1 through 6; - controlling a recording operation of the video recording device in response to the detected content element.

[10] A method of communicating information about a content element of a video program, the method comprising: - receiving an audio fingerprint data item generated by a device for presenting multimedia content, the audio fingerprint data item representing a predetermined content element of the presented multimedia content; - providing a number of reference audio fingerprint data items each related to a corresponding content element; - comparing the received audio fingerprint data item with at least one of the number of reference audio fingerprint data items; and - if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, identifying the corresponding related content element as detected.

[11] An arrangement for detecting content in a multimedia signal, the arrangement comprising: means for providing a multimedia signal, the multimedia signal comprising video data and corresponding audio data; processing means for determining an audio fingerprint data item from a predetermined part of the audio data; processing means adapted to compare the determined audio fingerprint data item with at least one of a number of reference audio fingerprint data items each related to a corresponding content element and, if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, to identify the corresponding related content element as detected.

[12] A recorder for recording video program material, the recorder comprising an arrangement according to claim 10.

[13] A system for communicating information about a content element of a video program, the system comprising: a device for presenting a multimedia signal, the multimedia signal comprising video data and corresponding audio data, the device for presenting a multimedia signal comprising - processing means for determining an audio fingerprint data item from a predetermined part of the presented audio data; - communications means for transmitting the determined audio fingerprint data item; a data processing system comprising - communications means for receiving the transmitted audio fingerprint data item; - storage means having stored thereon a number of reference audio fingerprint data items each related to a corresponding content element; - processing means adapted to compare the received audio fingerprint data item with at least one of the number of reference audio fingerprint data items, and, if at least a first one of the number of reference audio fingerprint data items is identified to correspond to the determined audio fingerprint data item, to identify the corresponding related content element as detected.