WO2011121318A1 - Method and apparatus for determining playback points in recorded media content - Google Patents

Method and apparatus for determining playback points in recorded media content Download PDF

Info

Publication number
WO2011121318A1
WO2011121318A1 PCT/GB2011/000514 GB2011000514W WO2011121318A1 WO 2011121318 A1 WO2011121318 A1 WO 2011121318A1 GB 2011000514 W GB2011000514 W GB 2011000514W WO 2011121318 A1 WO2011121318 A1 WO 2011121318A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
media content
timings
determining
content
Prior art date
Application number
PCT/GB2011/000514
Other languages
French (fr)
Inventor
Christopher Edward Poole
Roderick Hodgson
Jeff Hunter
Original Assignee
British Broadcasting Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Broadcasting Corporation filed Critical British Broadcasting Corporation
Publication of WO2011121318A1 publication Critical patent/WO2011121318A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/11Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information not detectable on the record carrier
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/433Content storage operation, e.g. storage operation in response to a pause request, caching operations
    • H04N21/4334Recording operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8455Structuring of content, e.g. decomposing content into time segments involving pointers to the content, e.g. pointers to the I-frames of the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/79Processing of colour television signals in connection with recording
    • H04N9/80Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback
    • H04N9/82Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback the individual colour picture signal components being recorded simultaneously only
    • H04N9/8205Transformation of the television signal for recording, e.g. modulation, frequency changing; Inverse transformation for playback the individual colour picture signal components being recorded simultaneously only involving the multiplexing of an additional signal and the colour video signal
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/27Arrangements for recording or accumulating broadcast information or broadcast-related information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/35Arrangements for identifying or recognising characteristics with a direct linkage to broadcast information or to broadcast space-time, e.g. for identifying broadcast stations or for identifying users
    • H04H60/37Arrangements for identifying or recognising characteristics with a direct linkage to broadcast information or to broadcast space-time, e.g. for identifying broadcast stations or for identifying users for identifying segments of broadcast information, e.g. scenes or extracting programme ID
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04HBROADCAST COMMUNICATION
    • H04H60/00Arrangements for broadcast applications with a direct linking to broadcast information or broadcast space-time; Broadcast-related systems
    • H04H60/56Arrangements characterised by components specially adapted for monitoring, identification or recognition covered by groups H04H60/29-H04H60/54

Definitions

  • This invention relates to determining playback points within a recording of media content.
  • Audio, video and audio-video are all forms of media content which may be broadcast, downloaded or supplied on a pre-recorded medium.
  • downloaded or pre-recorded media content the start and end of a given programme is easily defined.
  • downloaded audio-video data will usually comprise only the programme content requested without any additional data beyond the requested programme.
  • PVR personal video recorder
  • DVB EITp/f asynchronous signalling in the broadcast stream to start and end the recording
  • operational practice means that typically it will precede the start of the programme by anything from a few 10s of seconds to a few minutes.
  • the resulting recording may therefore include content such as a continuity announcement and perhaps some adverts or trailers, before the desired programme begins. So use of this signalling to drive acquisition makes the assumption that the user will not mind the recording starting earlier than the start of the programme.
  • WO 01/41145 describes identifying first and second index marks on a video recording and comparing to first and second index marks of another recording to identify the recording and provide information of the recording to a user.
  • a similar technique is disclosed in WO 01/45130 which uses signatures of key frames within video content and compares these against previously stored signature and time codes to identify the recorded content.
  • EP 1827018 describes a method for identifying identical sections of video content contained within separate recordings.
  • US 2007/0058949 discloses a system that attempts to start recording a broadcast programme earlier than the scheduled time if a comparison of features from a broadcast signal suggests that a programme has started early.
  • WO 01/60061 discloses using signatures of key frames to try and determine start and stop times of a recording device for recording a broadcast programme in spite of a change of the broadcast time.
  • EP 1099220 discloses a system in which playback points are created during a recording, so that content may be located within the recording.
  • a consumer device such as a "set-top-box" can achieve better accuracy using a traditionally-recorded programme by providing access to additional information regarding the programme timing.
  • a set-top-box recording device can record a broadcast channel and start and stop a recording at rough points either side of a given programme, but then automatically start play at the beginning of the programme accurately when play of the programme is selected by a user.
  • the invention is defined in the independent claims to which reference is now directed with preferred features set out in the dependent claims.
  • the invention provides a method and apparatus for accurately determining a point within a recording by extracting and matching features in the recording with timed reference data.
  • features are extracted from a recorded broadcast of audio-video content and compared against reference data.
  • the reference data comprises a collection of distinctive pieces of data each representing a feature which can be found within the recorded content.
  • the reference data includes timing information for the features and can be created separately from the content or extracted from the content as broadcast. Each item of reference data has a corresponding time relative to the start of the programme.
  • the recorded content is analysed and data representing features of one or more types are extracted from the recording. A subset or all of the data representing the features is then matched with the reference data.
  • the timing information carried by the reference can then be used to determine a position within the recorded content, referred to as a playback point, such as the start of the programme within the recording, the beginning of a section, an interactive point or other such position.
  • Figure 1 is a schematic diagram showing broadcast media content, a recording and the desired portion for playback;
  • FIG. 2 is a flow diagram of a process embodying the invention
  • Figure 3 shows the matching process in a set-top-box
  • Figure 4 shows the matching process at a separate web service
  • Figure 5 shows an extraction and matching process for subtitles
  • Figure 6 shows a matching correlation process
  • Figure 7 shows how timing data may be extracted from a transport stream
  • Figure 8 shows the process for identifying start and stop points of subtitles within a transport stream.
  • Embodiments of the present invention provide an apparatus and method for accurately determining a playback point within recorded media content.
  • the embodiments will be primarily described with reference to broadcast, audio-video television programmes, but could apply equally to audio only, video only or other media content having a time line.
  • the main application of the embodiments of the invention is to identify the start point of a television programme, or parts of a programme, within content recorded from a broadcast stream.
  • no new equipment is required to be installed in the broadcast encoding and distribution system.
  • a consumer device is arranged to extract existing latent features from the recorded media content itself for comparison with an "original” or “reference".
  • the solution is independent of the entire broadcast encoding and distribution system and requires no modifications to it.
  • the use of an "original” or “reference” of the same programme against which a recording is compared allows the start of a programme within a recording to be accurately determined using the timings of features within the recorded content.
  • the comparison process may be very efficient as the comparison of timing information needs only be made against the corresponding reference data for the given programme.
  • the identity of the programme itself is already known and recorded in the broadcast signal such as using EITp f signalling.
  • the system may then compare timings of features in the recorded content against timings in the corresponding reference data only, rather than attempting to search in data relating to many possible programmes.
  • FIG. 1 shows a schematic diagram showing a broadcast stream of a television channel including three programmes, as well as the existing timing signalling within the available data, the portion recorded as a result and the actual desired start and finish points of the recorded content (being the television programme T actually required).
  • a broadcast stream of audio-video data 10 comprises television programmes broadcast one after another with additional material broadcast in between, such as adverts, trailers and the like.
  • the start and stop points of television programmes can be broadly identified, but only to an accuracy of 10s of seconds and possibly even minutes.
  • the stream of data 30 actually acquired and recorded therefore includes a desired programme "programme Y" but also trailers in advance of the programme and adverts after the programme.
  • the start and stop points are not sufficiently accurate to identify just the programme Y 40 being the desired playback of the recorded material.
  • the acquired asset 30 being the recorded media content includes additional material beyond the programme actually desired by the user. This is because recordings are typically created by a user selecting a programme to be recorded from an electronic programme guide and the recording is then started using the signalling as described above.
  • a user may have simply manually recorded a design programme, but typically will press the start button in advance of the broadcast of the design programme to ensure that the entire programme is captured.
  • This inaccuracy is a result of the lack of accuracy of the signalling or manual control, rather than inaccuracy in the recording device itself.
  • Recording devices such as personal video recorders, set-top-boxes, hard disk, storage devices, personal computers and the like, have the inherent ability to provide frame accuracy to start playback if the desired start point is known.
  • the embodying technique makes use of the fact that a recording device can extract features from recorded content either by use of functions implemented in software or dedicated hardware modules within the recording device.
  • the features may be any identifiable aspects of recorded content which have timings within the content. The timings may be start times, stop times, durations or the like, so that the temporal position of such features within recorded content can be identified.
  • Such features may be features that are explicitly included within a content stream, such as subtitles and shot changes, or could be features that are extracted from the content, such as video frame complexity, video motion or audio pitch or amplitude. Common to all such features is that they can be identified at particular temporal positions within a content stream and compared to reference data. The choice of features to be extracted depends upon the capabilities of the recording device.
  • the features may be ones that are easily identified such as subtitles, for which it is easy to measure start and stop points and duration. This information is available from the typical way in which a broadcast stream is encoded, and so little processing power is required to extract this information.
  • a recording device having greater processing power can extract features that are not explicitly stated in the encoded content stream.
  • the features so derived may also be an indirect measure of some visible or audible characteristic, for example a probability of the existence of an aspect within the recording, such as a probability of a shot change. Such more complex features may be compared against a reference in exactly the same way as features that are explicitly encoded in the recorded content.
  • the main use of the embodying technique is to find the start point of a given programme.
  • the user's set-top-box or similar recording device may then automatically play from the start of a programme when the programme is selected from a menu of recorded programmes, instead of playing from the start of the recording which may include extraneous material.
  • the recorded content may be in a variety of formats, but the main example is MPEG2.
  • MPEG2 This is a container which allows the combination of audio, video and subtitles in a manner which is easy for the client to decode when streamed.
  • a Transport Stream consists of a series of fixed size packets. Each of these Transport Stream (TS) packets belong to one of the components (such as audio or video) of the stream.
  • TS Transport Stream
  • a header indicates which stream it belongs to and may be followed by an adaptation field optionally carrying a clock (the Programme Clock Reference - PCR).
  • the source media (audio or video) is encoded and packaged into PES packets.
  • PES packets each contain a frame of media, in the case of video one frame, or for audio a number of samples.
  • Each PES packet has a timestamp which indicates to the client the time at which it is to be presented to the viewer.
  • PES packets are split up into smaller pieces and placed within the payload of TS packets.
  • the start of a PES packet always gets a new TS packet.
  • the TS packet containing the end of a PES packet has padding bytes added to the header as necessary.
  • TS packets from the different streams are multiplexed together - creating one stream with all the packets in, with the timing between the streams aligned so that players do not have to buffer too much information.
  • the embodiments may apply equally to recordings which comprise separate streams of data which are then presented to a user based on timing information within each of the separate streams.
  • features may be extracted from one component, for example the video component, and used in the comparison to establish playback points for all components.
  • the embodiments also allow for the features to be extracted from one of the components and used to establish the playback points for a different one of the components.
  • the components used for finding the timing information need not even be presented during playback of the recording - this is the case if a subtitle stream is used to identify playback points, but the subtitle stream itself is not necessarily presented to a user on playback of the recording.
  • the embodying technique provides an offset for the recording as a whole and so can be used to identify the offset for a single multiplex stream of components or for separate streams of components.
  • the recorded media content comprises the components that make up a given audio, video or audio video data stream whether multiplexed or not.
  • Reference Data may itself be a recording of the broadcast content.
  • the features used for comparison may be extracted from the reference content in the same or similar manner to the extraction from the recorded content.
  • the reference data comprises timing information for features that will be assembled into a broadcast stream.
  • subtitle information inherently has timing information when created and prior to insertion into a broadcast stream. This timing information may be used as the reference data.
  • both embodiments may use all the techniques described.
  • One approach is to have the broadcaster provide the reference information through an online service, and the recording device can choose which information it would like to correlate with the recording based on what is available. It can then make use of its own algorithms to try and match the features and calculate a start offset.
  • a more flexible approach is once again to have the broadcaster provide the reference information through an online service. Once the reference has been obtained by the device, it submits this along with all the key data it extracted from the recording onto to another online service. This service may be hosted by the broadcaster or a separate service provider.
  • the service can return an offset to the device allowing the device to calculate a start point.
  • the advantage of this method is that the subset selection and matching algorithms can be altered and tweaked later without any change required on the device. Therefore if a channel changes the type of data it broadcasts (adding new subtitles, etc) the system can be updated easily.
  • a media content 101 is broadcast at step 102 as part of a broadcast channel and includes one or more programmes for which the start and stop points are designed.
  • the broadcast includes now/next triggers 103 (as already described in relation to Figure 1) and is then recorded as step 104 to a store at a user's recording device.
  • an acquired asset 30 has been recorded, as shown in Figure 1.
  • an extraction step 105 features within the recording are extracted.
  • the media content 101 is held in a store 106 as a set of records, including things such as subtitles, bit rate and so on.
  • Timing reference data is then provided at a step 108 and then is matched at a comparison step 109 with features extracted from the recording.
  • an offset of the recorded broadcast programme can be calculated in relation to the timing reference data at a comparison step 1 10 which also determines one or more playback points as a result of the calculation of the offset.
  • an advantage of comparing features extracted from the recording against timing of features from reference data from which the original broadcast programme was derived is that the timing offset may be rapidly and accurately determined without the need for multiple comparisons against non- relevant data.
  • the comparison is directly between an identified recorded programme and the corresponding reference data.
  • a recording device such as the set-top-box 200, comprises functional blocks implemented as software or hardware, including means 203 for detecting now and next triggers and means 204 for recording the broadcast media to a store.
  • a means 205 for extracting features from the recording is provided, along with means 208 for receiving timing information from a reference source.
  • the set-top-box also includes means 209 for matching the features extracted from the recording with timing information from the reference source and means 210 for determining one or more playback points by calculating an offset of the recording from the reference timing data.
  • a broadcast media 301 is packaged for delivery as a stream 302, including a programme and is also provided as a recording of reference data 306 from which features may be extracted by an extraction block 307.
  • the extraction block may be provided by the broadcaster, but may also be provided by a separate entity such as a web service.
  • the set-top-box may store timing information for multiple programmes at block 208 and, in the matching process, may select the timing information for the corresponding reference source from a store.
  • the means for providing the timing information for the corresponding reference source may download the information at that time. Either way, the programme for which the start time is requested is identified so that the appropriate corresponding reference timing information is selected.
  • the identification of the programme is by extracting a programme identifier from the programme recorded, either as part of the extraction or by a separate means for providing the data identifying the content.
  • FIG. 4 A second embodiment in which the comparison is provided outside of the set-top- box is shown in Figure 4. Like components are labelled with the like numbers for convenience.
  • the set-top-box 200 has a now and next trigger module 203 and means 204 for recording media and means 205 for extracting features from the completed recording.
  • a broadcaster 300 has a recording of the media 301 which is assembled as a broadcast programme 302, as well as provided as a set of records 306 which is provided to the module 307 for extracting features from the records.
  • the timing information from the reference data is not provided to the set-top-box, but instead is provided to a means 408 for providing the reference timing information, such as a web service.
  • a means 408 for providing the reference timing information such as a web service.
  • This is provided to a matching block 409 which provides the means for comparing timings.
  • the set- top-box means 205 for extracting features includes means for delivering the timings of the features to the web service feature matching block 409 which then provides the comparison to a calculation block 410 which provides means for determining an offset and the playback points.
  • the means 409 for comparing and means 410 for calculating/determining the offset together may be a web service 41 1 separate from the web service 408 and which provides back to the set-top-box timing information sufficient for the set-top-box to then understand the start point of the required television programme.
  • the web service 41 1 can either provide an offset by which the set-top-box 200 can then determine the start point, or could provide an explicit start point to the set-top
  • the features that can be matched should be easily extractable from the recorded media and be quite easily compared to the original media.
  • the same principle could also be applied to other features that can be measured on both the receiver and from the original media.
  • Suitable features could be:
  • the consumer submits the programme ID and the extracted feature(s) to a broadcaster-specific web service.
  • the web service looks up the predetermined records of features (subtitles, shot changes, histograms, etc.) for that programme.
  • the programme ID is one example of data identifying the content for which the comparison is required. Other identifiers are possible, the programme ID is preferred as it explicitly indicates the programme for which the start time is desired.
  • the web service finds the set of reference features that most likely correspond to the set of features extracted, and returns the best fit time offset between the timings extracted from the recording and the reference timings.
  • the correlation technique should preferably be selected to tolerate small differences between the feature sequences.
  • the set-top-box can determine accurately the programme's start time within the recording. Knowing the duration of the programme, an accurate end time can also be calculated.
  • the preferred features used in either of the embodiments of the invention described are subtitle start, stop and duration timing information. This will now be described in relation to the second approach as shown in Figure 5.
  • the general process is as follows.
  • the set-top-box extracts the start, change, or stop signals of a all or a subset of the subtitles.
  • the subset could be the first 30 or so or at least 5 minutes to ensure that start of the desired programme is included.
  • the set-top-box then submits these times to a web service operated by the content provider.
  • the web service downloads a representation of the subtitle reference file for that programme, and parses the file to determine the timing of the subtitles.
  • the web service attempts to match the subtitles in a way that minimizes the error between the two data sets.
  • the matching process attempts to match the subtitles by correlating their durations. An alternative is correlating the times between one subtitle and the next.
  • the web service returns the best estimate of the time offset between the sets of timings. This gives the time offset within the recording for the start of the programme.
  • the set-top-box plays back the file from the resulting time offset. If the calculated time offset is negative, this indicates that the first part of the programme is missing in the recording, e.g. because the recording started late. In this case, the set-top-box plays back the file from the beginning.
  • the specific operation in relation to PEG2 is as follows:
  • the recording device parses the 505 Program Map Table data in the recording to identify on what stream the subtitle data can be found.
  • the device parses the recording byte-by- byte, processing packets of the subtitle stream only. When it finds a payload unit start indicator, signifying the start of a PES packet, it begins to extract the PES packet.
  • the device parses the PES packet taking note of the presentation timestamp for that packet.
  • the device then examines the contents of the PES packet to extract the subtitle segments contained within.
  • the device determines whether the PES packet contains the beginning of a new subtitle, an update of subtitles on screen, or the end of a subtitle by identifying how many regions are listed in the page composition segment, which of these contain subtitle objects, and whether there is a version change for the page or for any region.
  • the device continues parsing the subtitle stream until it has extracted the required number of subtitles for comparison.
  • the equivalent reference is found by parsing 507 a subtitle file 506.
  • This can be in an EBU STL format or a plain-text format such as DFXP.
  • the web service providing the reference can parse the file and extract the on-screen duration of each subtitle frame.
  • the subset of subtitles in each data set are extracted 513, 514 from the larger set. Each is provided as a sequence of durations 515, 516.
  • the two subsets of data are matched 509 using a global alignment algorithm on the two sequences, such as the Needleman/Wunsch algorithm.
  • the best fit time offset between the two sequences is then determined by calculating the mean of the time offsets between the recording and the reference for all matching subtitles.
  • Another solution could be interpreting the broadcast subtitles with optical character recognition and including the recognised characters in the data used for the correlation. Using this data, alignment can be done with a much smaller subset, comparing the text within the subtitles, and using edit distance as a measure of similarity. Shot changes
  • An alternative technique which may be used instead of subtitle detection or in combination with subtitle detection is to use shot changes as identifiable features within the recorded content.
  • a video shot change is when a continuous sequence of frames ends and a new scene is abruptly presented in the next frame. Shot changes can be identified within a broadcast recording by a variety of techniques.
  • MPEG2 continuous sequences of frames are compressed using inter-frame algorithms.
  • inter-frame coding cannot be used and so the presence of a shot change is inherently shown by a difference in the coding technique.
  • the set-top-box may analyse the video stream and determine either the definitive existence of a shot change or a measure of the likelihood of a shot change along with timing information for the shot changes. This can be provided to a web service or reference data received from a web service in the same manner as described above, to establish a timing offset and consequently playback points within the video stream.
  • the device parses the Program Map Table data in the recording to identify on what stream the video data can be found.
  • the stream parses through the file, byte-by-byte, identifying intra-coded macro-blocks in B-frames. It takes note of the presentation timestamp values of any B-frames which have a high number of intra-coded macro-blocks (corresponding to a likely shot change).
  • a corresponding reference data set is created either by processing the original media using a standard shot change detection algorithm, or by extracting shot change information from the edit decision list used to create the programme.
  • a subset of each of the two data sets is produced, with a series of timestamp values presented as a sequence.
  • An alignment algorithm is used, similar to that of the subtitles (Needleman/Wunsch or other).
  • the start time is provided as an offset calculated by the presentation timestamp of a shot change, minus the media time of the equivalent reference shot change.
  • a further approach for use in any of the previous embodiments is to use motion vectors as a characteristic.
  • the presence of motion vectors can be used as a characteristic in its own right irrespective of whether or not shot changes are identified. This is an example of features that are not explicitly stated within a transport stream but are inherent and not visibly presented to a viewer, but nonetheless have accompanying timing data to allow comparison to a reference source.
  • a specific approach is as follows.
  • the device parses the Program Map Table data in the recording to identify on what stream the video data can be found.
  • the stream parses through the video stream within the file, taking note of the motion vectors it finds on each predicted frame.
  • the motion vectors for each block where motion exists are then averaged into a single motion vector for each final frame.
  • This same process is performed by the broadcaster on a similar encoding of the original media, providing a reference data set.
  • a subset of each of the two data sets is produced, with a series of vectors presented as a sequence.
  • An alignment algorithm is used, similar to that of the subtitles (Needleman/Wunsch or other).
  • the start time is provided as an offset calculated by the presentation timestamp of the start of the matched sequence, minus the media time of the equivalent.
  • a variety of possible comparison algorithms may be used in the means for comparing timings against a reference data set.
  • An example is shown in Figure 6.
  • the algorithm in this example may take the decoded durations from the recorded data of subtitles and compare to reference durations. The algorithm attempts to match each feature sequentially, finding an overall minimum error between the recorded and reference sequences, possibly skipping a feature in either set if such a gap results in a better overall match.
  • the first subtitle is not part of the desired programme and the so the alignment algorithm will use the second subtitle onwards for the timing reference.
  • the outcome is that an offset of 40 seconds is determined based upon the input data shown.
  • an algorithm which can take input parameters is used to allow configuration such as whether there would be a high expectation of data to be missing and to provide scoring appropriately.
  • an algorithm which can take input parameters is used to allow configuration such as whether there would be a high expectation of data to be missing and to provide scoring appropriately.
  • Such customisation could be based on feedback of the accuracy of results, in accordance with standard algorithm techniques.
  • the algorithm may use feature data representing different characteristics alone or in combination, for example using subtitles, shot changes and audio information in combination. The comparison could also decide just to use one or more of these features, so that each technique is given a confidence measure and the one with the highest measure of confidence would be selected. Timing Data
  • the MPEG2 transport stream already described will now be described for completeness with reference to Figures 7 and 8 to show how timing information may be analysed and used.
  • the transport stream comprises packets with a payload and packet headers, with each packet carrying a media stream (e.g. video, audio or subtitles) having a payload comprising part of a PES packet having a PES header and a PES payload.
  • the PES payload carries one or more segments each of which has a segment type which can be used to determine whether segment data contains page information (page composition segment) or region information (region composition segment) or other information. This can be used to determine the start or stop of a subtitle.
  • the PES header includes a presentation time stamp to determine time within the stream.
  • the segment data contains information about the visible regions and visible objects.
  • the process for analysis of such a transport stream is shown in Figure 8.
  • the segment On receipt of a new segment 601 the segment is stored as a step 602 with segments having the same presentation time stamp.
  • determining step 603 it is determined whether a page version or region version has changed since the last presentation time stamp. If yes, it is determined at step 604 if a page segment contains regions and if yes, were there any visible regions containing objects at step 605? If not, this denotes the end of a subtitle. If yes, a determination is made at step 606 as to whether previous subtitles are on screen. If not, this denotes the start of a subtitle. If yes, then this denotes the end of a subtitle and start of a subtitle. In this way, a transport stream can be analysed to determine start and stop times of subtitles based on the information in the transport stream. The same technique may be used for other characteristic features.

Abstract

An apparatus for determining playback points such as the start of a television programme within recorded media content extracts timings of features from within the content and compared these against reference data. The reference data is retrieved using an identifier of the content so that the reference data compared is the reference data relevant to the content. The comparison of the timings of the features allows the relative offset between the recorded content and the reference data to be determined and thereby the start of a television programme within the recording to be determined so that the apparatus can automatically start playing from the beginning of the programme. The features for which timings are used may include shot changes, subtitles and other characteristic features within the recording.

Description

METHOD AND APPARATUS FOR DETERMINING PLAYBACK POINTS IN RECORDED MEDIA CONTENT
BACKGROUND OF THE INVENTION
This invention relates to determining playback points within a recording of media content.
Audio, video and audio-video are all forms of media content which may be broadcast, downloaded or supplied on a pre-recorded medium. In the case of downloaded or pre-recorded media content, the start and end of a given programme is easily defined. For example, downloaded audio-video data will usually comprise only the programme content requested without any additional data beyond the requested programme.
Traditional recording of a broadcast, by contrast, using a personal video recorder (PVR) generally makes use of asynchronous signalling in the broadcast stream to start and end the recording, e.g. DVB EITp/f. Whilst such signalling could be used to signal the start of a programme to within a few seconds, operational practice means that typically it will precede the start of the programme by anything from a few 10s of seconds to a few minutes. The resulting recording may therefore include content such as a continuity announcement and perhaps some adverts or trailers, before the desired programme begins. So use of this signalling to drive acquisition makes the assumption that the user will not mind the recording starting earlier than the start of the programme.
Existing methods allow for additional timing information to be broadcast with the media content (e.g. DVB Synchronized Auxiliary Data). It would also be possible for broadcasters to install equipment to record the start and end times of programmes with reference to the inherent clock for the broadcast stream (e.g. MPEG-2 TS 'PCR') and provide these to the client. However, both these methods would require significant investment by broadcasters to install and integrate new equipment in mission critical broadcast encoding and distribution systems, which would be even more costly for broadcasters with many regional variants of their channels. Various techniques are known for analysing and identifying content using various characteristics. WO 01/41145, for example, describes identifying first and second index marks on a video recording and comparing to first and second index marks of another recording to identify the recording and provide information of the recording to a user. A similar technique is disclosed in WO 01/45130 which uses signatures of key frames within video content and compares these against previously stored signature and time codes to identify the recorded content. EP 1827018 describes a method for identifying identical sections of video content contained within separate recordings.
In addition to the above, techniques are known for controlling video recorders. US 2007/0058949, for example, discloses a system that attempts to start recording a broadcast programme earlier than the scheduled time if a comparison of features from a broadcast signal suggests that a programme has started early. Similarly, WO 01/60061 discloses using signatures of key frames to try and determine start and stop times of a recording device for recording a broadcast programme in spite of a change of the broadcast time. Lastly, EP 1099220 discloses a system in which playback points are created during a recording, so that content may be located within the recording.
SUMMARY OF THE INVENTION
We have appreciated the need to improve the accuracy of determining playback points within an audio and/ or video recording, such as a start or stop point within the recorded content. In particular, we have appreciated the need to determine the start of a programme with improved accuracy.
We have further appreciated that a consumer device such as a "set-top-box" can achieve better accuracy using a traditionally-recorded programme by providing access to additional information regarding the programme timing.
We have further appreciated that none of the known systems or methods take an appropriate approach to recording and subsequently playing back at the start of a programme within a recording. Using the invention, a set-top-box recording device can record a broadcast channel and start and stop a recording at rough points either side of a given programme, but then automatically start play at the beginning of the programme accurately when play of the programme is selected by a user. The invention is defined in the independent claims to which reference is now directed with preferred features set out in the dependent claims. In broad terms, the invention provides a method and apparatus for accurately determining a point within a recording by extracting and matching features in the recording with timed reference data. In an embodiment of the invention, at a consumer device features are extracted from a recorded broadcast of audio-video content and compared against reference data.
The reference data comprises a collection of distinctive pieces of data each representing a feature which can be found within the recorded content. The reference data includes timing information for the features and can be created separately from the content or extracted from the content as broadcast. Each item of reference data has a corresponding time relative to the start of the programme. The recorded content is analysed and data representing features of one or more types are extracted from the recording. A subset or all of the data representing the features is then matched with the reference data. The timing information carried by the reference can then be used to determine a position within the recorded content, referred to as a playback point, such as the start of the programme within the recording, the beginning of a section, an interactive point or other such position.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will now be described in more detail by way of example with reference to the drawings, in which:
Figure 1 : is a schematic diagram showing broadcast media content, a recording and the desired portion for playback;
Figure 2 is a flow diagram of a process embodying the invention;
Figure 3 shows the matching process in a set-top-box;
Figure 4 shows the matching process at a separate web service;
Figure 5 shows an extraction and matching process for subtitles;
Figure 6 shows a matching correlation process;
Figure 7 shows how timing data may be extracted from a transport stream;
and
Figure 8: shows the process for identifying start and stop points of subtitles within a transport stream.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention provide an apparatus and method for accurately determining a playback point within recorded media content. The embodiments will be primarily described with reference to broadcast, audio-video television programmes, but could apply equally to audio only, video only or other media content having a time line. The main application of the embodiments of the invention is to identify the start point of a television programme, or parts of a programme, within content recorded from a broadcast stream.
In the embodiment of the invention, no new equipment is required to be installed in the broadcast encoding and distribution system. Instead, a consumer device is arranged to extract existing latent features from the recorded media content itself for comparison with an "original" or "reference". In this way, the solution is independent of the entire broadcast encoding and distribution system and requires no modifications to it. The use of an "original" or "reference" of the same programme against which a recording is compared allows the start of a programme within a recording to be accurately determined using the timings of features within the recorded content. The comparison process may be very efficient as the comparison of timing information needs only be made against the corresponding reference data for the given programme. The identity of the programme itself is already known and recorded in the broadcast signal such as using EITp f signalling. The system may then compare timings of features in the recorded content against timings in the corresponding reference data only, rather than attempting to search in data relating to many possible programmes.
In order to determine the relative time position of the recorded and reference content, there is no need to compare the actual content of characteristics, simply the timing of the characteristics. The comparison of timings of two characteristics, therefore, is unlikely to provide sufficient accuracy as the timing of two characteristics is unlikely to be unique within the reference content. The accuracy of determination will increase with the number of timings compared and, for this reason, a correlation is the preferred comparison. Figure 1 shows a schematic diagram showing a broadcast stream of a television channel including three programmes, as well as the existing timing signalling within the available data, the portion recorded as a result and the actual desired start and finish points of the recorded content (being the television programme T actually required). A broadcast stream of audio-video data 10 comprises television programmes broadcast one after another with additional material broadcast in between, such as adverts, trailers and the like. As shown by the EITp/f signalling 20, the start and stop points of television programmes can be broadly identified, but only to an accuracy of 10s of seconds and possibly even minutes. The stream of data 30 actually acquired and recorded therefore includes a desired programme "programme Y" but also trailers in advance of the programme and adverts after the programme. The start and stop points are not sufficiently accurate to identify just the programme Y 40 being the desired playback of the recorded material. As shown in Figure 1 , the acquired asset 30 being the recorded media content includes additional material beyond the programme actually desired by the user. This is because recordings are typically created by a user selecting a programme to be recorded from an electronic programme guide and the recording is then started using the signalling as described above. Alternatively, a user may have simply manually recorded a design programme, but typically will press the start button in advance of the broadcast of the design programme to ensure that the entire programme is captured. This inaccuracy is a result of the lack of accuracy of the signalling or manual control, rather than inaccuracy in the recording device itself. Recording devices such as personal video recorders, set-top-boxes, hard disk, storage devices, personal computers and the like, have the inherent ability to provide frame accuracy to start playback if the desired start point is known.
Embodying Technique
The embodying technique makes use of the fact that a recording device can extract features from recorded content either by use of functions implemented in software or dedicated hardware modules within the recording device. The features may be any identifiable aspects of recorded content which have timings within the content. The timings may be start times, stop times, durations or the like, so that the temporal position of such features within recorded content can be identified. Such features may be features that are explicitly included within a content stream, such as subtitles and shot changes, or could be features that are extracted from the content, such as video frame complexity, video motion or audio pitch or amplitude. Common to all such features is that they can be identified at particular temporal positions within a content stream and compared to reference data. The choice of features to be extracted depends upon the capabilities of the recording device. For a less powerful set-top-box device, the features may be ones that are easily identified such as subtitles, for which it is easy to measure start and stop points and duration. This information is available from the typical way in which a broadcast stream is encoded, and so little processing power is required to extract this information. A recording device having greater processing power can extract features that are not explicitly stated in the encoded content stream. The features so derived may also be an indirect measure of some visible or audible characteristic, for example a probability of the existence of an aspect within the recording, such as a probability of a shot change. Such more complex features may be compared against a reference in exactly the same way as features that are explicitly encoded in the recorded content.
The main use of the embodying technique is to find the start point of a given programme. The user's set-top-box or similar recording device may then automatically play from the start of a programme when the programme is selected from a menu of recorded programmes, instead of playing from the start of the recording which may include extraneous material. Recorded Content
In the embodiments of the invention, the recorded content may be in a variety of formats, but the main example is MPEG2. This is a container which allows the combination of audio, video and subtitles in a manner which is easy for the client to decode when streamed. A Transport Stream consists of a series of fixed size packets. Each of these Transport Stream (TS) packets belong to one of the components (such as audio or video) of the stream. A header indicates which stream it belongs to and may be followed by an adaptation field optionally carrying a clock (the Programme Clock Reference - PCR).
When creating a transport stream the source media (audio or video) is encoded and packaged into PES packets. These PES packets each contain a frame of media, in the case of video one frame, or for audio a number of samples. Each PES packet has a timestamp which indicates to the client the time at which it is to be presented to the viewer.
These PES packets are split up into smaller pieces and placed within the payload of TS packets. The start of a PES packet always gets a new TS packet. To accommodate this the TS packet containing the end of a PES packet has padding bytes added to the header as necessary.
Finally the TS packets from the different streams (audio and video) are multiplexed together - creating one stream with all the packets in, with the timing between the streams aligned so that players do not have to buffer too much information.
Whilst a multiplex stream such as MPEG2 is preferred, the embodiments may apply equally to recordings which comprise separate streams of data which are then presented to a user based on timing information within each of the separate streams.
In either approach, features may be extracted from one component, for example the video component, and used in the comparison to establish playback points for all components. The embodiments also allow for the features to be extracted from one of the components and used to establish the playback points for a different one of the components. The components used for finding the timing information need not even be presented during playback of the recording - this is the case if a subtitle stream is used to identify playback points, but the subtitle stream itself is not necessarily presented to a user on playback of the recording. The embodying technique provides an offset for the recording as a whole and so can be used to identify the offset for a single multiplex stream of components or for separate streams of components. In this sense, the recorded media content comprises the components that make up a given audio, video or audio video data stream whether multiplexed or not.
Reference Data The reference data may itself be a recording of the broadcast content. The features used for comparison may be extracted from the reference content in the same or similar manner to the extraction from the recorded content. However, many features found within a broadcast stream are accurately separately identified prior to assembly of a broadcast stream and so it is preferred that the reference data comprises timing information for features that will be assembled into a broadcast stream. For example, subtitle information inherently has timing information when created and prior to insertion into a broadcast stream. This timing information may be used as the reference data. Process and Components
The two main approaches to assigning responsibilities to different components will now be described as two separate embodiments. For the avoidance of doubt, both embodiments may use all the techniques described. One approach is to have the broadcaster provide the reference information through an online service, and the recording device can choose which information it would like to correlate with the recording based on what is available. It can then make use of its own algorithms to try and match the features and calculate a start offset. A more flexible approach is once again to have the broadcaster provide the reference information through an online service. Once the reference has been obtained by the device, it submits this along with all the key data it extracted from the recording onto to another online service. This service may be hosted by the broadcaster or a separate service provider. It will choose features to compare based on which are available in both sets and will then use its own algorithm to correlate the data sets. Using the media time information of one of the features, the service can return an offset to the device allowing the device to calculate a start point. The advantage of this method is that the subset selection and matching algorithms can be altered and tweaked later without any change required on the device. Therefore if a channel changes the type of data it broadcasts (adding new subtitles, etc) the system can be updated easily.
The overall process of the two main embodiments of the invention will now be described with reference to Figure 2. A media content 101 is broadcast at step 102 as part of a broadcast channel and includes one or more programmes for which the start and stop points are designed. The broadcast includes now/next triggers 103 (as already described in relation to Figure 1) and is then recorded as step 104 to a store at a user's recording device. At this stage, an acquired asset 30 has been recorded, as shown in Figure 1. At an extraction step 105 features within the recording are extracted. Separately from the broadcast and recording of the programme, the media content 101 is held in a store 106 as a set of records, including things such as subtitles, bit rate and so on. Features may be extracted from these records at step 107, the extraction process being as simple as taking data that is already in an appropriate form, as in the case of subtitles, or performing more complex calculations to derive information from the content. The timing reference data is then provided at a step 108 and then is matched at a comparison step 109 with features extracted from the recording. Lastly, an offset of the recorded broadcast programme can be calculated in relation to the timing reference data at a comparison step 1 10 which also determines one or more playback points as a result of the calculation of the offset.
As previously described, an advantage of comparing features extracted from the recording against timing of features from reference data from which the original broadcast programme was derived is that the timing offset may be rapidly and accurately determined without the need for multiple comparisons against non- relevant data. At step 109, the comparison is directly between an identified recorded programme and the corresponding reference data.
An embodiment in which a set-top-box 200 comprises the means for comparing timings will now be described with reference to Figure 3. A recording device, such as the set-top-box 200, comprises functional blocks implemented as software or hardware, including means 203 for detecting now and next triggers and means 204 for recording the broadcast media to a store. A means 205 for extracting features from the recording is provided, along with means 208 for receiving timing information from a reference source. In this arrangement, the set-top-box also includes means 209 for matching the features extracted from the recording with timing information from the reference source and means 210 for determining one or more playback points by calculating an offset of the recording from the reference timing data.
At the head end, broadly identified as broadcaster 300, a broadcast media 301 is packaged for delivery as a stream 302, including a programme and is also provided as a recording of reference data 306 from which features may be extracted by an extraction block 307. The extraction block may be provided by the broadcaster, but may also be provided by a separate entity such as a web service. In this embodiment, the set-top-box may store timing information for multiple programmes at block 208 and, in the matching process, may select the timing information for the corresponding reference source from a store. Alternatively, at the point of requesting the comparison, the means for providing the timing information for the corresponding reference source may download the information at that time. Either way, the programme for which the start time is requested is identified so that the appropriate corresponding reference timing information is selected. The identification of the programme is by extracting a programme identifier from the programme recorded, either as part of the extraction or by a separate means for providing the data identifying the content.
A second embodiment in which the comparison is provided outside of the set-top- box is shown in Figure 4. Like components are labelled with the like numbers for convenience. As previously described in relation to Figure 3, the set-top-box 200 has a now and next trigger module 203 and means 204 for recording media and means 205 for extracting features from the completed recording. Also as previously described, a broadcaster 300 has a recording of the media 301 which is assembled as a broadcast programme 302, as well as provided as a set of records 306 which is provided to the module 307 for extracting features from the records.
In this embodiment, the timing information from the reference data is not provided to the set-top-box, but instead is provided to a means 408 for providing the reference timing information, such as a web service. This is provided to a matching block 409 which provides the means for comparing timings. The set- top-box means 205 for extracting features includes means for delivering the timings of the features to the web service feature matching block 409 which then provides the comparison to a calculation block 410 which provides means for determining an offset and the playback points. The means 409 for comparing and means 410 for calculating/determining the offset together may be a web service 41 1 separate from the web service 408 and which provides back to the set-top-box timing information sufficient for the set-top-box to then understand the start point of the required television programme. The web service 41 1 can either provide an offset by which the set-top-box 200 can then determine the start point, or could provide an explicit start point to the set-top-box.
As already described, the features that can be matched should be easily extractable from the recorded media and be quite easily compared to the original media. Several types of data exist that match this requirement: Subtitle frame duration, shot changes, video motion at a given time. The same principle could also be applied to other features that can be measured on both the receiver and from the original media.
Suitable features could be:
• Subtitle intervals or durations
· Video frame complexity
• Shot changes
• Video motion at a given time
• Audio pitch or amplitude
• Any other feature that can be measured both on the receiver and from the original media.
Features could be used either individually or in combination.
In the embodiments described the consumer submits the programme ID and the extracted feature(s) to a broadcaster-specific web service. The web service looks up the predetermined records of features (subtitles, shot changes, histograms, etc.) for that programme.
The programme ID is one example of data identifying the content for which the comparison is required. Other identifiers are possible, the programme ID is preferred as it explicitly indicates the programme for which the start time is desired.
Using correlation techniques, the web service finds the set of reference features that most likely correspond to the set of features extracted, and returns the best fit time offset between the timings extracted from the recording and the reference timings.
As the features present in the recording may have been altered slightly from the features in the original programme as a result of the broadcast encoding process or as a result of reception errors present when the recording was made, the correlation technique should preferably be selected to tolerate small differences between the feature sequences. By taking the best fit time offset, the set-top-box can determine accurately the programme's start time within the recording. Knowing the duration of the programme, an accurate end time can also be calculated.
Subtitle Data
The preferred features used in either of the embodiments of the invention described are subtitle start, stop and duration timing information. This will now be described in relation to the second approach as shown in Figure 5. The general process is as follows. The set-top-box extracts the start, change, or stop signals of a all or a subset of the subtitles. The subset could be the first 30 or so or at least 5 minutes to ensure that start of the desired programme is included. The set-top-box then submits these times to a web service operated by the content provider.
The web service downloads a representation of the subtitle reference file for that programme, and parses the file to determine the timing of the subtitles. The web service attempts to match the subtitles in a way that minimizes the error between the two data sets. The matching process attempts to match the subtitles by correlating their durations. An alternative is correlating the times between one subtitle and the next. The web service returns the best estimate of the time offset between the sets of timings. This gives the time offset within the recording for the start of the programme. The set-top-box plays back the file from the resulting time offset. If the calculated time offset is negative, this indicates that the first part of the programme is missing in the recording, e.g. because the recording started late. In this case, the set-top-box plays back the file from the beginning.
The specific operation in relation to PEG2 is as follows: The recording device parses the 505 Program Map Table data in the recording to identify on what stream the subtitle data can be found. The device parses the recording byte-by- byte, processing packets of the subtitle stream only. When it finds a payload unit start indicator, signifying the start of a PES packet, it begins to extract the PES packet. The device parses the PES packet taking note of the presentation timestamp for that packet. The device then examines the contents of the PES packet to extract the subtitle segments contained within. The device then determines whether the PES packet contains the beginning of a new subtitle, an update of subtitles on screen, or the end of a subtitle by identifying how many regions are listed in the page composition segment, which of these contain subtitle objects, and whether there is a version change for the page or for any region. The device continues parsing the subtitle stream until it has extracted the required number of subtitles for comparison.
The equivalent reference is found by parsing 507 a subtitle file 506. This can be in an EBU STL format or a plain-text format such as DFXP. The web service providing the reference can parse the file and extract the on-screen duration of each subtitle frame.
The subset of subtitles in each data set are extracted 513, 514 from the larger set. Each is provided as a sequence of durations 515, 516. The two subsets of data are matched 509 using a global alignment algorithm on the two sequences, such as the Needleman/Wunsch algorithm. The best fit time offset between the two sequences is then determined by calculating the mean of the time offsets between the recording and the reference for all matching subtitles.
Another solution could be interpreting the broadcast subtitles with optical character recognition and including the recognised characters in the data used for the correlation. Using this data, alignment can be done with a much smaller subset, comparing the text within the subtitles, and using edit distance as a measure of similarity. Shot changes
An alternative technique which may be used instead of subtitle detection or in combination with subtitle detection is to use shot changes as identifiable features within the recorded content. A video shot change is when a continuous sequence of frames ends and a new scene is abruptly presented in the next frame. Shot changes can be identified within a broadcast recording by a variety of techniques. In MPEG2, continuous sequences of frames are compressed using inter-frame algorithms. When changing from a continuous sequence of frames to a different scene, inter-frame coding cannot be used and so the presence of a shot change is inherently shown by a difference in the coding technique. The set-top-box may analyse the video stream and determine either the definitive existence of a shot change or a measure of the likelihood of a shot change along with timing information for the shot changes. This can be provided to a web service or reference data received from a web service in the same manner as described above, to establish a timing offset and consequently playback points within the video stream.
One specific approach is as follows. The device parses the Program Map Table data in the recording to identify on what stream the video data can be found. The stream parses through the file, byte-by-byte, identifying intra-coded macro-blocks in B-frames. It takes note of the presentation timestamp values of any B-frames which have a high number of intra-coded macro-blocks (corresponding to a likely shot change).
A corresponding reference data set is created either by processing the original media using a standard shot change detection algorithm, or by extracting shot change information from the edit decision list used to create the programme. A subset of each of the two data sets is produced, with a series of timestamp values presented as a sequence.
An alignment algorithm is used, similar to that of the subtitles (Needleman/Wunsch or other). The start time is provided as an offset calculated by the presentation timestamp of a shot change, minus the media time of the equivalent reference shot change. Motion vectors
A further approach for use in any of the previous embodiments is to use motion vectors as a characteristic. The presence of motion vectors can be used as a characteristic in its own right irrespective of whether or not shot changes are identified. This is an example of features that are not explicitly stated within a transport stream but are inherent and not visibly presented to a viewer, but nonetheless have accompanying timing data to allow comparison to a reference source.
A specific approach is as follows. The device parses the Program Map Table data in the recording to identify on what stream the video data can be found. The stream parses through the video stream within the file, taking note of the motion vectors it finds on each predicted frame. The motion vectors for each block where motion exists are then averaged into a single motion vector for each final frame.
This same process is performed by the broadcaster on a similar encoding of the original media, providing a reference data set.
A subset of each of the two data sets is produced, with a series of vectors presented as a sequence. An alignment algorithm is used, similar to that of the subtitles (Needleman/Wunsch or other). The start time is provided as an offset calculated by the presentation timestamp of the start of the matched sequence, minus the media time of the equivalent.
Comparison Algorithms
A variety of possible comparison algorithms may be used in the means for comparing timings against a reference data set. An example is shown in Figure 6. The algorithm in this example may take the decoded durations from the recorded data of subtitles and compare to reference durations. The algorithm attempts to match each feature sequentially, finding an overall minimum error between the recorded and reference sequences, possibly skipping a feature in either set if such a gap results in a better overall match. In the example shown, the first subtitle is not part of the desired programme and the so the alignment algorithm will use the second subtitle onwards for the timing reference. In this example, the outcome is that an offset of 40 seconds is determined based upon the input data shown.
A variety of algorithms may be used. Preferably, an algorithm which can take input parameters is used to allow configuration such as whether there would be a high expectation of data to be missing and to provide scoring appropriately. In the example of subtitles, we would expect very few missing subtitles and so the absence of data may be scored highly, whereas we would expect many minor timing differences and so these should be included in the algorithm. Such customisation could be based on feedback of the accuracy of results, in accordance with standard algorithm techniques. The algorithm may use feature data representing different characteristics alone or in combination, for example using subtitles, shot changes and audio information in combination. The comparison could also decide just to use one or more of these features, so that each technique is given a confidence measure and the one with the highest measure of confidence would be selected. Timing Data
The MPEG2 transport stream already described will now be described for completeness with reference to Figures 7 and 8 to show how timing information may be analysed and used. The transport stream comprises packets with a payload and packet headers, with each packet carrying a media stream (e.g. video, audio or subtitles) having a payload comprising part of a PES packet having a PES header and a PES payload. In the case of a PES packet carrying subtitle data, the PES payload carries one or more segments each of which has a segment type which can be used to determine whether segment data contains page information (page composition segment) or region information (region composition segment) or other information. This can be used to determine the start or stop of a subtitle. The PES header includes a presentation time stamp to determine time within the stream. The segment data contains information about the visible regions and visible objects. The process for analysis of such a transport stream is shown in Figure 8. On receipt of a new segment 601 the segment is stored as a step 602 with segments having the same presentation time stamp. At determining step 603 it is determined whether a page version or region version has changed since the last presentation time stamp. If yes, it is determined at step 604 if a page segment contains regions and if yes, were there any visible regions containing objects at step 605? If not, this denotes the end of a subtitle. If yes, a determination is made at step 606 as to whether previous subtitles are on screen. If not, this denotes the start of a subtitle. If yes, then this denotes the end of a subtitle and start of a subtitle. In this way, a transport stream can be analysed to determine start and stop times of subtitles based on the information in the transport stream. The same technique may be used for other characteristic features.

Claims

1. Apparatus for determining playback points within recorded media content, comprising:
means for extracting from the recorded media content timings of features within the content;
means for comparing the timings of the features against reference data having timing information retrieved using an identifier of the media content; and means for determining one or more playback points within the recorded media content using the results of the comparison.
2. Apparatus according to claim 1 , wherein the recorded media content comprises one or more of video, audio and subtitle streams.
3. Apparatus according to claim 1 or 2, wherein the timings comprise one or more of start times, stop times or durations of the features.
4. Apparatus according to any preceding claim, wherein the features comprise subtitles.
5. Apparatus according to any preceding claim, wherein the features comprise shot changes.
6. Apparatus according to any preceding claim, wherein the means for comparing comprises means for correlating.
7. Apparatus according to any preceding claim, wherein the means for comparing comprises means for matching
8. Apparatus according to any preceding claim, wherein the means for comparing is arranged to compare timings of multiple types of features from the recorded media content against data for respective types of features represented in the reference data.
9. Apparatus according to any preceding claim, wherein the means for detecting is arranged to analyse media encapsulated in an MPEG-2 transport stream.
10. Apparatus according to any preceding claim, wherein the means for extracting comprises means for detecting features using information placed in the recorded media during the encoding process.
1 1. Apparatus according to claim 10, wherein the means for detecting is arranged to analyse encoded DVB subtitle data.
12. Apparatus according to any preceding claim, wherein means for comparing comprises an algorithm arranged to compare all or a subset of the timings.
13. Apparatus according to claim 12, wherein the algorithm is arranged to selectively exclude a feature in either the recorded media content or the reference data if this results in a better overall match.
14. Apparatus according to any preceding claim, wherein the means for determining playback points includes means for determining an offset of at least one of the features from a start point.
15. Apparatus according to claim 14, wherein the start point is the start of a programme.
16. Apparatus according to claim 14 or 15, wherein the means for determining playback points comprises means for determining the start of a programme.
17. Apparatus according to any preceding claim, wherein the apparatus is a recording device and includes means for obtaining the reference data for the media content from a remote source.
18. Apparatus according to claim 17, wherein the apparatus is a set-top-box.
19. Apparatus according to any of claims 1 to 16, wherein the apparatus includes a recording device and a remote service, wherein the recording device includes the means for extracting and the remote service includes the means for comparing and means for determining.
20. Apparatus for determining playback points within recorded media content, comprising:
means for extracting from the recorded media content timings of features within the content;
means for transmitting the timings of features within the content and an identifier of the media content to a remote service for comparing the timings of the features against reference data having timing information retrieved using the identifier; and
means for receiving the results of the comparison from the remote service and for determining one or more playback points within the recorded media content using the results of the comparison.
21. A method for determining playback points within recorded media content, comprising:
extracting from the recorded media content timings of features within the content;
comparing the timings of the features against reference data having timing information retrieved using an identifier of the media content; and
determining one or more playback points within the recorded media content using the results of the comparison.
22. A method according to claim 21 , wherein the recorded media content comprises one or more of video, audio and subtitle streams.
23. A method according to claim 21 or 22, wherein the timings comprise one or more of start times, stop times or durations of the features.
24. A method according to any of claims 21 to 23, wherein the features comprise subtitles.
25. A method according to any of claims 21 to 24, wherein the features comprise shot changes.
26. A method according to any of claims 21 to 25, wherein the comparing comprises correlating timings of the features against the reference data.
27. A method according to any of claims 21 to 26, wherein the comparing comprises matching timings of the features against the reference data.
28. A method according to any of claims 21 to 27, wherein the comparing compares timings of multiple types of features from the recorded media content against data for respective types of features represented in the reference data.
29. A method according to any of claims 21 to 28, wherein the extracting comprises detecting features using information placed in the recorded media during the encoding process.
30. A method according to claim 29, wherein the detecting comprises analysing encoded DVB subtitle data.
31. A method according to claim 29, wherein the detecting comprises analysing media encapsulated in an PEG-2 transport stream.
32. A method according to any of claims 21 to 31 , wherein the comparing comprises executing an algorithm to compare all or a subset of the timings.
33. A method according to claim 32, wherein the algorithm is arranged to selectively exclude a feature in either the recorded media content or the reference data if this results in a better overall match.
34. A method according to any of claims 21 to 33, wherein determining playback points includes determining an offset of at least one of the features from a start point.
35. A method according to claim 34, wherein the start point is the start of a programme.
36. A method according to claim 34 or 35, wherein determining playback points comprises determining the start of a programme.
37. A method according to any preceding claim, including obtaining the reference timing data for the media content from a remote source.
38. A method according to any of claims 21 to 36, including communicating with a remote service for the comparing and the determining.
39. A method according to any of claims 21 to 36, wherein the extracting is performed by a recording device and the comparing and determining is performed by a remote service.
40. A method for determining playback points within recorded media content, comprising:
extracting, in a recording device, from the recorded media content timings of features within the content;
transmitting the timings of features within the content and an identifier of the media content from the recording device to a remote service;
comparing, at the remote service, the timings of the features against reference data having timing information retrieved using the identifier; and
receiving, at the recording device, the results of the comparison from the remote service and determining one or more playback points within the recorded media content using the results of the comparison.
41. Apparatus according to any of claims 1 to 20, wherein the means for comparing is arranged to compare the timings and characteristics of the features.
42. A method according to any of claims 21 to 40, wherein the comparing includes comparing the timings and characteristics of the features.
43. Apparatus according to any of claims 1 to 20, further comprising means for automatically playing the recorded content from one of the one or more playback points.
44. A method according to any of claims 21 to 42, further comprising automatically playing the recorded content from one of the one or more playback points.
PCT/GB2011/000514 2010-04-01 2011-04-01 Method and apparatus for determining playback points in recorded media content WO2011121318A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1005635.6 2010-04-01
GB201005635A GB2479711A (en) 2010-04-01 2010-04-01 Determining playback points in recorded media content

Publications (1)

Publication Number Publication Date
WO2011121318A1 true WO2011121318A1 (en) 2011-10-06

Family

ID=42228841

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2011/000514 WO2011121318A1 (en) 2010-04-01 2011-04-01 Method and apparatus for determining playback points in recorded media content

Country Status (2)

Country Link
GB (1) GB2479711A (en)
WO (1) WO2011121318A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017047540A1 (en) * 2015-09-16 2017-03-23 ソニー株式会社 Transmission device, transmission method, reproduction device, and reproduction method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1099220A1 (en) 1998-07-20 2001-05-16 4TV House Identification of video storage media
WO2001041145A1 (en) 1999-11-30 2001-06-07 Koninklijke Philips Electronics N.V. Method and apparatus to identify sequential content stored on a storage medium
WO2001045130A2 (en) 1999-12-16 2001-06-21 Trusi Technologies, Llc Plasma generator ignition circuit
WO2001060061A2 (en) 2000-02-07 2001-08-16 Koninklijke Philips Electronics N.V. Methods and apparatus for recording programs prior to or beyond a preset recording time period
US20020120925A1 (en) * 2000-03-28 2002-08-29 Logan James D. Audio and video program recording, editing and playback systems using metadata
WO2006055971A2 (en) * 2004-11-22 2006-05-26 Nielsen Media Research, Inc Methods and apparatus for media source identification and time shifted media consumption measurements
US20070058949A1 (en) 2005-09-15 2007-03-15 Hamzy Mark J Synching a recording time of a program to the actual program broadcast time for the program
EP1827018A1 (en) 2004-12-03 2007-08-29 NEC Corporation Video content reproduction supporting method, video content reproduction supporting system, and information delivery program
EP1975938A1 (en) * 2007-03-31 2008-10-01 Sony Deutschland Gmbh Method for determining a point in time within an audio signal
US20080256115A1 (en) * 2007-04-11 2008-10-16 Oleg Beletski Systems, apparatuses and methods for identifying transitions of content
EP2061239A2 (en) * 2007-11-19 2009-05-20 Echostar Technologies Corporation Methods and apparatus for identifying video locations in a video stream using text data
US7555196B1 (en) * 2002-09-19 2009-06-30 Microsoft Corporation Methods and systems for synchronizing timecodes when sending indices to client devices

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001045103A1 (en) * 1999-12-14 2001-06-21 Koninklijke Philips Electronics N.V. Method and apparatus to identify content stored on a storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1099220A1 (en) 1998-07-20 2001-05-16 4TV House Identification of video storage media
WO2001041145A1 (en) 1999-11-30 2001-06-07 Koninklijke Philips Electronics N.V. Method and apparatus to identify sequential content stored on a storage medium
WO2001045130A2 (en) 1999-12-16 2001-06-21 Trusi Technologies, Llc Plasma generator ignition circuit
WO2001060061A2 (en) 2000-02-07 2001-08-16 Koninklijke Philips Electronics N.V. Methods and apparatus for recording programs prior to or beyond a preset recording time period
US20020120925A1 (en) * 2000-03-28 2002-08-29 Logan James D. Audio and video program recording, editing and playback systems using metadata
US7555196B1 (en) * 2002-09-19 2009-06-30 Microsoft Corporation Methods and systems for synchronizing timecodes when sending indices to client devices
WO2006055971A2 (en) * 2004-11-22 2006-05-26 Nielsen Media Research, Inc Methods and apparatus for media source identification and time shifted media consumption measurements
EP1827018A1 (en) 2004-12-03 2007-08-29 NEC Corporation Video content reproduction supporting method, video content reproduction supporting system, and information delivery program
US20070058949A1 (en) 2005-09-15 2007-03-15 Hamzy Mark J Synching a recording time of a program to the actual program broadcast time for the program
EP1975938A1 (en) * 2007-03-31 2008-10-01 Sony Deutschland Gmbh Method for determining a point in time within an audio signal
US20080256115A1 (en) * 2007-04-11 2008-10-16 Oleg Beletski Systems, apparatuses and methods for identifying transitions of content
EP2061239A2 (en) * 2007-11-19 2009-05-20 Echostar Technologies Corporation Methods and apparatus for identifying video locations in a video stream using text data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017047540A1 (en) * 2015-09-16 2017-03-23 ソニー株式会社 Transmission device, transmission method, reproduction device, and reproduction method
CN108028949A (en) * 2015-09-16 2018-05-11 索尼公司 Transmitting device, transmission method, transcriber and reproducting method
JPWO2017047540A1 (en) * 2015-09-16 2018-07-05 ソニー株式会社 Transmission device, transmission method, reproduction device, and reproduction method
US10511802B2 (en) 2015-09-16 2019-12-17 Sony Corporation Transmission device, transmission method, reproducing device and reproducing method
AU2016323754B2 (en) * 2015-09-16 2021-01-14 Sony Corporation Transmission device, transmission method, reproduction device, and reproduction method

Also Published As

Publication number Publication date
GB201005635D0 (en) 2010-05-19
GB2479711A (en) 2011-10-26

Similar Documents

Publication Publication Date Title
EP2506595B1 (en) A method for creating event identification data comprising a hash value sequence data and information specifying one or more actions related to a multimedia program content
US20210195280A1 (en) Apparatus, systems and methods for control of media content event recording
EP2549771B1 (en) Method and apparatus for viewing customized multimedia segments
US20070136782A1 (en) Methods and apparatus for identifying media content
US8620466B2 (en) Method for determining a point in time within an audio signal
US7738767B2 (en) Method, apparatus and program for recording and playing back content data, method, apparatus and program for playing back content data, and method, apparatus and program for recording content data
US20030095790A1 (en) Methods and apparatus for generating navigation information on the fly
US20100169911A1 (en) System for Automatically Monitoring Viewing Activities of Television Signals
EP2773108B1 (en) Reception device, reception method, program, and information processing system
WO2005041455A1 (en) Video content detection
US20100122279A1 (en) Method for Automatically Monitoring Viewing Activities of Television Signals
US20030048843A1 (en) Image information summary apparatus, image information summary method and image information summary processing program
WO2005057931A2 (en) Method and system for generating highlights
US20050060757A1 (en) Apparatus and method of broadcast service for transporting scene information
WO2011121318A1 (en) Method and apparatus for determining playback points in recorded media content
KR100626645B1 (en) Apparatus and method for detecting the boundary between different programs using NPT Reference Descriptor, and DTV receiving apparatus and method for recording the predicted program using its
JP2006080589A (en) Edit information sharing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11716288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11716288

Country of ref document: EP

Kind code of ref document: A1