CN103098485A

CN103098485A - Method and apparatus for encapsulating coded multi-component video

Info

Publication number: CN103098485A
Application number: CN2011800396198A
Authority: CN
Inventors: 吴振宇; 朱立华
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2010-06-14
Filing date: 2011-06-13
Publication date: 2013-05-08
Also published as: JP2013534101A

Abstract

A method and a device for encapsulating a media entity containing more than one layer into multiple component files, each for one layer, are described along with the corresponding method and device for component file reading. Extensions to the Extractor data structure of SVC / MVC file formats are proposed. The extractor extensions of the invention enable NAL units referencing across different component files. The present invention enables adaptive HTTP streaming of media entities.

Description

The method and apparatus of many compositions video of encapsulation coding

The cross reference of related application

Present patent application requires on June 14th, 2010 submitted to, sequence number No.61/354,422, the benefit of priority of the U.S. Provisional Patent Application of title " Extension to the Extractor data structure of SVC/MVC file formats ".Incorporate clearly by reference the instruction of above-mentioned temporary patent application at this.

That the application relates to is that submit to simultaneously with the application, following pending trial own together, sequence number is that the No.___ title is the U.S. Patent application (acting on behalf of case NO.PU100141) of " Method and Apparatus for Encapsulating Coded Multi-component Video ".Incorporate clearly by reference the instruction of above-mentioned non-temporary patent application at this.

Technical field

Relate generally to HTTP flow transmission of the present invention.More specifically, the present invention relates to the media entity of many compositions video flowing of the coding such as scalable video coding (SVC) stream and multi-view coded (MVC) stream is encapsulated, with for the HTTP flow transmission.

Background technology

Use for the HTTP flow transmission, at server side, often the video with coding encapsulates and is stored as the file that meets BMFF, such as the MP4 file.In addition, in order to realize adaptive H TTP flow transmission, file is divided into usually that a plurality of vidclips and these fragments further be grouped into can be by the segmentation of client URL request addressing.In practice, the expression of the different coding of video content is stored in these segmentations, the expression that makes client can dynamically choose expectation during session is downloaded and is play.

The layered video of the coding such as SVC or MVC bit stream, different subsets by decoding bit stream, enabling different operating points aspect time/spatial resolution, quality, view etc., i.e. expression, and provide the adaptive natural support of this bit rate.Yet existing ISO base media file form (BMFF) standard such as the MP4 file format, can not provide the independent access to each layer or expression, and therefore can not be applied to the application of HTTP flow transmission.As shown in Figure 1, in the MP4 file format, the metadata of all layers of a media file or expression is stored in moov movie box (movie box), and the media content data of all layers or expression is stored in the mdat movie box.In the HTTP flow transmission, when a layer of client-requested, due to all layers or expression is mixed together and client does not know where to find needed layer or expression, so have to send whole file.

As seeing after a while, in adaptive H TTP flow transmission was used, expectation can be crossed over the border of vidclip or composing document and be come the references media data sample, such as network abstract layer (NAL) unit.In the SVC/MVC context, can set up this quoting as the mechanism of " extractor " by using.Extractor (extractor) is the internal file data structure of definition in the SVC/MVC that the AVC file format of BMFF is expanded revises: Information Technology-coding of audio-visual objects-Part15:Advanced Video Coding (AVC) file format, Amendment2:File format support for Scalable Video Coding, 2008 (15-17 pages).Extractor is designed to make it possible to extract the NAL unit in the situation that do not copy by reference from other track (track).Herein, track is the time series (timedsequence) of the relevant sample in the ISO base media file.For media data, track is corresponding to the sequence of the audio frequency of image or sampling.The grammer of extractor below is shown:

The semanteme of extractor data structure is:

NALUnitHeader: this NAL cellular construction is as being that the NAL unit of type 20 is specified in ISO/IEC14496-10 appendix G:

Nal_unit_type should be set to extractor NAL cell type (type 31).

Forbidden_zeroz-bit, reserved_one_bit, and reserved_three_2bits should be set to as specified in ISO/IEC14496-10 appendix G.

other field (nal_ref_idc, idr_flag, priority_id, no_inter_layer_pred_flag, dependency_id, quality_id, temporal_id, use_ref_base_pic_flag, discardable_flag, and output_flag) should be set to as at Information Technology-Coding of audio-visual objects-part15:Advanced Video Coding (AVC) file format, Amendment2:File format support for Scalable Video Coding, ISO/IEC14496-15:2004/Amd.2:2008 (the 17th page) B.4 in specified.

Track_ref_index has specified the index of quoting for the track of the type " scal " that finds the track that therefrom extracts data.Extract therefrom that sample in the track of data is aligned in time or be (the closest preceding) of close front on the media decodes timeline, namely, service time-sample table only, its utilization has the skew of the sample_offset appointment of the sample (sample) that comprises extractor and adjusts.The first track is quoted has index value 1; Value 0 keeps.

Sample_offset has provided the relative indexing of sample in the track of the link that should be used as information source.Sample 0(zero) be to have to compare identical or the sample of the decode time of close front with the decode time of the sample that comprises extractor; Value 1(one) be next sample, value-1(negative 1) be previous sample, etc.

Data_offset: the skew of quoting first byte that will copy in sample.Begin with the first byte data in this sample if extract, the deviant value 0.This skew should be quoted the beginning of NAL element length field.

Data_length: the byte number that copy.If this field value 0 copies the complete single NAL of quoting unit (that is, obtaining the length that will copy from the length field that data-bias is quoted, in the situation that Aggregators increases the additional_bytes field).

Further details can be at Information technology-Coding of audio-visual objects-part15:Advanced Video Coding (AVC) file format, Amendment2:File format support for Scalable Video Coding finds in ISO/IEC14496-15:2004/Amd.2:2008.

Current extractor only can by reference from other track, still in same movie box/fragment, extract the NAL unit.In other words, can not use extractor to extract the NAL unit from different segmentations or file.This constrained use extractor under above-mentioned use situation.

Also do not form fully in the prior art the existing solution to the problems referred to above.Providing a kind of with expectation resolves and encapsulates and do not sacrifice the ability of speed and efficiency of transmission layer.Unrealized such effect also in the prior art so far.

Summary of the invention

The present invention is directed to the method and apparatus that encapsulates composing document and read composing document for from the media entity that comprises a more than layer.

According to an aspect of the present invention, provide a kind of for encapsulate and create the method for composing document from the media entity that comprises a more than layer.The method is from the metadata of each layer of media entity extraction and the media data corresponding with the metadata of extracting, and the quoting of the relevant additional media data of the media data that extracts of identification and each layer.This is quoted be embedded in the media data that each layer extract.The media data that extracts and metadata are associated make it possible to comprise for each layer establishment the composing document of the metadata of extracting and the media data that extracts.

According to another aspect of the present invention, provide a kind of file wrapper.This document wrapper comprises: extractor, and it is from the metadata of each layer of media entity extraction and the media data corresponding with the metadata of extracting; Quote identifier, identification to the media data that extracts with each layer relevant, quoting from the additional media data of media entity; And correlator, this is quoted be embedded in the media data that each layer extract, and the media data that extracts and the metadata of extracting are associated make it possible to be each layer establishment composing document.

Description of drawings

With reference to accompanying drawing, by describing exemplary embodiment of the present invention in detail, above-mentioned feature of the present invention will become clearer, in the accompanying drawings:

Fig. 1 shows the MP4 file format of example.

Fig. 2 shows an embodiment of packaged media entity of the present invention.

Fig. 3 shows for encapsulate or create the structure of the wrapper of composing document from the media entity that comprises a plurality of layer/expressions.

Fig. 4 shows the example that the additional media data is associated with composing document based on dependence.

Fig. 5 shows the example of extracting by reference the NAL unit from the movie box/fragment different from the residing movie box/fragment of extractor.

Fig. 6 show new extractor data structure that use invents related SVC/MVC type video stream is encapsulated into the encapsulation operation of a plurality of vidclips.

Fig. 7 shows the structure be used to the document reader that reads composing document.

Fig. 8 shows and relates to the process that Video Decoder of the present invention reads SVC/MVC type video stream.

Embodiment

In the present invention, be divided into or be encapsulated into can be by a plurality of film composing documents of client URL request addressing for the media entity such as media file or one group of media file or Streaming Media.At this, use composing document in more wide in range implication, expression fragment, segmentation, file and other term of equivalence with it.

In one embodiment of the invention, parsing comprises the media entity of a plurality of expressions or composition to extract metadata and the media data of each expression/composition.The example of expression/composition comprises layer, such as the layer that has various time/spatial resolutions and quality in SVC and the view in MVC.Hereinafter, also refer to and represent/form with layer, and these terms are used interchangeably.For example, metadata description is illustrated in for each and what has comprised in the media entity and how to have used the media data that wherein comprises.Media data has comprised the required media data sample of purpose (for example decode content) of serving media data, perhaps about how obtaining any necessary information of required data sample.To be associated/be correlated with for metadata and the media data of each expression or layer extraction and it will be stored together the access for the user.Physically, can carry out storage operation on hard disk drive or other storage medium, perhaps can carry out virtually storage operation by relation management mechanism makes: when metadata and media data in fact really are positioned on the diverse location of medium, when being connected with other application or module, they show as and are stored in together.Fig. 2 illustrates the example of this embodiment.In Fig. 2, the media entity comprises three layers: basic layer, enhancement layer 1 and enhancement layer 2.Resolve the media entity extracting each metadata and media data of three layers, and these data are stored as composing document individually, wherein the media data of metadata and correspondence is associated and is in the same place.

Fig. 3 shows for encapsulate and create the structure of the preferred wrapper 300 of composing document from the media entity (such as the SVC encoded video) that comprises a plurality of layers.Input medium entity 310 is by meta-data extractor 320 and media data extractor 340.Metadata 320 is extracted the metadata 330 of each layer.Media data extractor 340 is obtained metadata 330 and is extracted corresponding media data 350.Note, in different embodiment, meta-data extractor 320 and media data extractor 340 are implemented as an extractor.Two kinds of data, metadata 330 and media data 350 both are fed into correlator 380, and correlator 380 is correlated with these data of two types and is created output composing document 390, composing document of each layer.

The video of layering such as the coded video of AVC expansion of SVC or MVC, has comprised a plurality of media and has formed (scalable layer or view).By the different subsets of decoding bit stream, the bit stream of this coding can provide different operating points aspect time/spatial resolution, quality, view etc., that is, and and expression or layer.In addition, have the coding dependence between the layer of bit stream, that is, the decoding of one deck may depend on other layer.Therefore, for asking this bit stream one of to represent to need obtain and decode one or more compositions or media data from packed video file.Leaching process for the ease of difference represents often so that each layer is stored in different segmentations or the mode in composing document individually, is encapsulated as the MP4 file with the layered video of encoding.In this case, due to above-mentioned decoding dependence or based on other dependence of using, need to consider a plurality of segmentations or composing document is needed or some media data sample of relevant bit stream with it, such as the NAL unit.

In another embodiment of the present invention, extract segmentation or the needed additional media data of composing document and it is related with this segmentation or composing document.Fig. 4 shows the example of this embodiment.In the drawings, the SVC bit stream has three space layer, HD1080p, SD and QVGA.Formation is corresponding to three vidclips or the composing document of these three operating points, and each can be by different URL addressing.Inner at each vidclip or composing document, needed all the media data samples of decoding, the NAL unit in this example copies and is stored as the media sample that comprises in " mdat " box.Like this, by asking specific operating point with suitable URL or when representing, server can be obtained corresponding vidclip or composing document and it is transmitted to client when client.In this example, the media data extractor 340 in Fig. 3 also is extracted as for each layer the relevant additional media data of media data that each layer extracts from input medium entity 310.Correlator 380 also is associated to create corresponding composing document with the additional media data of extracting of each layer.

For the purpose of saving memory space, expectation can be crossed over vidclip or the references media data sample is come on the composing document border, such as the NAL unit, and without the identical data in each composing document of actual copy.Yet, ISO base media file form (BMFF) and expand current this feature of not supporting.For addressing this problem, in another embodiment of the present invention, relevant with the media data of vidclip or composing document or vidclip or those required additional media data of composing document are identified and set up and quote.This quoting, rather than those additional media data are associated with composing document together with metadata and media data thereof.These can be quoted and be embedded in the media data that each layer extract, and the metadata of then each layer being extracted and the media data that extracts be associated, be used for creating corresponding composing document.

In this embodiment, join in the structure of wrapper 300 quoting identifier 360.Quoting identifier 360 identifies from input medium entity 310 those relevant additional media data of the media data 350 that extracts with each layer is quoted 370.Then, for example by described quoting is embedded in the media data that extracts, will quotes 370 via correlator 380 and be associated with each layer metadata 330 of extracting and the media data 350 that extracts, be used for creating corresponding composing document 390.

As discussed previously, in the SVC/MVC context, can set up this quoting by the mechanism of using similar " extractor ".Current extractor only can extract the NAL unit by reference from other track, but in identical movie box/fragment.In other words, can not use extractor to extract the NAL unit from different segmentations or file.This restriction has restricted extractor use in other cases.Hereinafter, disclose a kind of expansion to the extractor data structure, wherein the purpose of expansion is to support as mentioned above efficiently the video content of SVC/MVC type hierarchical is encapsulated in a plurality of composing documents.

To add expansion in order to be provided for quoting the additional capabilities that resides in the movie box/fragment different from the resident movie box/fragment of extractor or composing document or the NAL unit in composing document to the extractor data structure.

As the extractor of the expansion of giving a definition:

Semantic:

Data-entry is URL(uniform resource locator) (URL) or unified resource title (URN) entry.Name is URN, and needs in the URN entry.Location is URL, and needs in the URL entry, and is optional in the URN entry, and wherein it has provided the position of seeking the resource with given title.Each is (null-terminated) character string of using the room end of UTF-8 character.If be provided with self-contained sign, use the URL form and character string do not occur; Frame (box) finishes with the entry attribute field.The URL type should be the service of file delivery.Relative URL is that allow and relevant with the file that comprises movie box/fragment, and described movie box/fragment comprises the track under extractor.

Other field has the semanteme identical with aforesaid original extraction device.

Utilize the extractor of expansion, can extract by reference the NAL unit now from the movie box/fragment different from the residing movie box/fragment of extractor.Fig. 5 shows this example, and it has identical SVC bit stream with Fig. 4, but uses the extractor data structure of new expansion.As can be as seen from the figure, the SD vidclip can be quoted the NAL unit from the QVGA vidclip now.Equally, the HD1080p vidclip can use extractor to quote from QVGA and SD vidclip NAL unit both.Compare with Fig. 4, do not cross over these vidclips and copy the NAL unit, therefore saved memory space.

Fig. 6 shows the encapsulation operation that relates to when new extractor data structure that use invents is encapsulated into a plurality of vidclips or composing document with the video bit stream of SVC/MVC type.This process starts from step 601.Read one by one each NAL unit in step 610.If arrive the ending of bit stream in step 620, this process stops 690; Otherwise this process advances to next step 630.Determination step 630 is determined whether current NAL unit depends on from the NAL unit of other track and is decoded.If determining that current NAL unit does not rely on from the NAL unit of other track decodes, controls metastasis is to step 640, wherein uses current NAL unit form sample and place it in current track.If step 630 is determined in current NAL unit and Existence dependency between from the NAL unit of other track, this process continues to forward to step 650.Determination step 650 determines further whether the track that needed NAL unit, current NAL unit is derived from resides in identical vidclip.If determine that this track resides in identical vidclip, adopt step 670 to fill the extractor of expansion in order to quote NAL unit from those other tracks.If determine that this track resides in different vidclips, URL or the URN of this vidclip of identification in step 660, and this process advances to step 670, and the URL that wherein identifies and URN are filled in the extractor of expansion.After the extractor of having filled this expansion, in step 680 is embedded into current track with it.Then, this process utilizes next NAL unit to restart in step 610.

In order to read composing document, adopt document reader 700 shown in Figure 7.At first resolver 710 resolves composing document obtaining metadata and media data, and quotes (if can obtain).According to quoting of decoding, if such as by the decoding dependence, that media data is relevant to the media data of other composing document, obtain device (retriever) 720 from obtaining relevant media data as quoting other composing document of indicating.Processor 730 is also processed metadata and media data and the additional media data (if can obtain) that obtain from composing document.The parse operation of resolver 710 comprises the acquisition metadata, is the ready media data of processor 730, is the operation that obtains the ready various necessity quoted of device 720.It also will be included in analytical element data and/or media data in case of necessity.In one embodiment, will quote and be embedded in metadata, and therefore obtain to quote by the analytical element data.If it is available quoting, analyzing step comprises that also analyzing grammer and the decoding of quoting quotes.If composing document comprises video content, processor 730 can comprise Video Decoder.In different embodiment, can incorporate resolver into and obtain device in processor.

Fig. 8 shows and relates to process of the present invention, that Video Decoder reads SVC/MVC type video bit stream.Step 801 access forms video file, metadata and the media data of every one deck of this composition video file of identification in step 805.Resolve metadata and the media data identify in step 810, and in step 815 each NAL unit of reading media data seriatim.For current NAL unit, judge at first that in step 820 in order to determine whether to arrive the ending of bit stream, and if answer is "Yes", this process finishes in step 825.Otherwise this process advances to determination step 830 to determine whether current NAL unit is extractor.If it is not extractor, this means that it is the common NAL unit that comprises decoded data, in step 835, the NAL unit is sent to decoder.If current NAL unit is extractor, determine in step 840 whether current NAL unit depends on the NAL unit of identical composing document outside.If required NAL unit is obtained it in step 845, and is sent it to decoder in step 835 in identical composing document from current file.If required NAL unit is from another composing document, use the Data_entry information in extractor that the NAL unit is positioned in step 850, and obtain it from telefile in step 855, then send it to decoder in step 835.

Although described the preferred embodiments of the present invention in detail at this, should be understood that to the invention is not restricted to these embodiment, and those skilled in the art can realize other modification and distortion and not break away from the scope of the present invention of appended claims definition.

Claims

1. one kind is used for said method comprising the steps of from the method for the media entity establishment composing document that comprises a more than layer:

Extract the metadata of each layer from described media entity;

Extract media data corresponding to metadata that extracts with each layer of described media entity from described media entity; And

Identification is quoted the relevant additional media data of the media data that extracts with each layer;

Described quoting is embedded in the media data that each layer extract; And

Each layer metadata of extracting and the media data that extracts are associated for creating corresponding composing document.

2. method according to claim 1, wherein, described composing document is at least one in movie box, vidclip, segmentation and file.

3. method according to claim 2, wherein said media data and additional media data comprise data sample.

4. method according to claim 3, wherein data sample comprises network abstraction layer unit.

5. method according to claim 4, the relevant described additional media data of the media data that wherein extracts with each layer comprise the network abstraction layer unit that the network abstraction layer unit in the media data that extracts relies on.

6. method according to claim 5, wherein said quoting comprises the positional information of described network abstraction layer unit in described additional media data.

7. method according to claim 6, wherein said positional information comprises at least one in URL(uniform resource locator) and unified resource title.

8. method according to claim 7 wherein embeds step and also comprises:

With the described network abstraction layer unit in described additional media data quoted to fill extractor; And

Described extractor is embedded in the track of the media data that extracts.

9. one kind is used for from the file wrapper of the media entity establishment composing document that comprises a more than layer, and this wrapper comprises:

Extractor, it extracts the metadata of each layer and extract media data corresponding to metadata that extracts with each layer of described media entity from described media data from the media entity;

Quote identifier, identification to the media data that extracts with each layer relevant, quoting from the additional media data of media entity;

Correlator is quoted this and is embedded in the media data that each layer extract, and the media data that extracts and the metadata of extracting are associated make it possible to comprise for each layer establishment the composing document of the metadata of extracting and the media data that extracts.

10. method according to claim 9, wherein, described composing document is at least one in movie box, vidclip, segmentation and file.

11. file wrapper according to claim 9, wherein said media data and additional media data comprise data sample.

12. file wrapper according to claim 11, wherein data sample comprises network abstraction layer unit.

13. file wrapper according to claim 12, the relevant described additional media data of the media data that wherein extracts with each layer comprise the network abstraction layer unit that the network abstraction layer unit in the media data that extracts relies on.

14. file wrapper according to claim 13, wherein said quoting comprises the positional information of described network abstraction layer unit in described additional media data.

15. file wrapper according to claim 14, wherein said positional information comprise at least one in URL(uniform resource locator) and unified resource title.

16. file wrapper according to claim 15, wherein correlator is also used the described network abstraction layer unit in described additional media data is quoted to fill extractor; And described extractor is embedded in the track of the media data that extracts.

17. a method that reads composing document comprises:

Resolve described composing document to obtain media data wherein and to quote; And

Quote according to described, if the media data of described composing document is relevant with the media data of other composing document, use described quoting from described other composing document to obtain relevant media data.

18. method according to claim 17, the media data of wherein said composing document is relevant with the media data of other composing document according to the coding dependence.

19. method according to claim 17, wherein said composing document are at least one in movie box, vidclip, segmentation and file.

20. method according to claim 17, wherein said media data and described relevant media data comprise data sample.

21. method according to claim 20, wherein data sample comprises network abstraction layer unit.

22. method according to claim 21, wherein said quoting comprises extractor.

23. a document reader comprises:

Resolver is resolved composing document to obtain media data wherein and to quote;

Obtain device, obtain the media data relevant with described media data according to described quoting from other composing document; And

Processor, the media data of processing described metadata, media data and obtaining from other composing document.

24. document reader according to claim 23, the media data of wherein said composing document is relevant with the media data of other composing document according to the coding dependence.

25. document reader according to claim 23, wherein said composing document are at least one in movie box, vidclip, segmentation and file.

26. document reader according to claim 23, wherein said media data and described relevant media data comprise data sample.

27. document reader according to claim 26, wherein data sample comprises network abstraction layer unit.

28. document reader according to claim 27, wherein said quoting comprises extractor.

29. document reader according to claim 23, wherein said processor comprises Video Decoder.