Data File Splitting
The present invention relates to the distribution of data files, and more specifically, to methods and systems for the distribution of data files which involve splitting of the files.
Background
Data such as media content is normally transmitted over telecommunications networks as single data files. Various considerations apply when transmitting data. Firstly, in order to provide security for data files such as media files, such data files are often encrypted. Further, due to their size and to bandwidth limitations, as well as for convenience, large data files may need to be split into smaller component files in order to be transmitted and/or stored, or may be interleaved in order to allow rapid access, powerful compression, or to simplify or increase the speed of decoding.
Security
For reasons of security, encryption algorithms are often used to encrypt file data to prevent access to their content by unwanted users, whereby a key is required in order to decrypt the file. Access to the decryption key thus becomes a crucial factor in determining who is able to decrypt the file.
Entitled users will generally receive a key through a registration and authentication process, which will then enable the receiver to decode the data stream. Some of the standards that are used in such a system are the Internet Protocol (IP) Encapsulating Security Payload (RFC 1827) to provide authentication, integrity and encryption, allowing any embedded information to be impossible to decode without access to the key. IP Authentication Header (RFC 1826) can be used in a trusted network (e.g. local Ethernet) to provide authentication and integrity without the overhead of encryption.
The use of these standards means that efforts have been focussed on the secure distribution and access of the decrypting keys for this process. The simplest case will be when a single entity will distribute the key depending on security requirements.
Encryption is often used in one-to-many situations such as over the air broadcasts, and multicast over IP networks to protect premium content. In these situations additional problems can arise when malicious receivers share their decryption key with other un-entitled receivers. This problem can be reduced through a regular "re-
keying" of the stream so that authentication and access control has to take place more regularly. In such systems it is often true that a trade off needs to be made between the security in particular of the multicast stream and the high cost of creating the keys required to transmit the data. In general, it will be noted however that encryption, by whatever method is chosen, is performed on the whole content of the data file, and the key is a separate item which is not related to or based on data making up part of the content of the data file. Encryption effectively "modifies data using a mathematical function" and it is never the intention to remove or "lose" data by a process of encryption.
File Splitting and Interleaving
On account of their size, large data files may need to be split into smaller component files in order to be transmitted and/or stored. There are a large number of computer software utilities which split data files into smaller component files - typically these are intended for converting a large file so that it fits on several floppy disks. The default component file size is thus usually set to 1.4 Mbytes. The split is done purely on the file length, so the first fragment of 1.4 Mbytes goes on the first floppy, the next 1.4
Mbytes on the second floppy etc.
Pixel, raster and frame interleaving are used in various image storage formats, but these are used solely for the purpose of storing image data in ways that facilitate rapid access, powerful compression, or ease-of-decoding quickly. They are never used to separate a single content file into component files.
Storage of information in databases is often interleaved in order to improve the processing: by placing related information in the same place, then it is quicker and easier to process. Kent Seamons' thesis details the process:
http://drl.cs.uiuc.edu/people/Seamons/thesis/node8.html
In Apple's QuickTime™ format, video files are typically interleaved audio files in order to minimise the storage required to keep the audio and video in synchronisation:
http://developer.apple.com/techpubs/guicktime/gtdevdocs/QTFF/gtff-204.html
Similar interleaving of bytes or words from audio and video sources are used in MPEG (Moving Picture Expert Group) encoding. But the final storage format of all these interleaved movie files is always a single file.
Some hard disk storage in RAID (Redundant Array of Independent Disks) arrays uses striping, where the data is distributed across several different hard disks. But the logical data file remains a single entity despite the fact that the physical file has been distributed across several hard disks. The interleaving algorithms used are very simple: 1 :1 for two hard disks, 1 :1 :1 for three, etc.
Some computer networks use interleaving of data packets in order to maximise the performance when the network is near capacity and lots of access 'collisions' occur. As with the QuickTime and MPEG examples, the intention here is to use time-sequencing to improve performance, not to provide multiple files for security purposes, and what is initially provided to a recipient relates to the whole content of the original data.
Layered Coding
Layered coding is a known method of delivering audio and/or video over a network. International Application WO 02/51149 describes a method of delivering video over a network using a multi-layered video coding system. The method of this document comprises separating a digitally compressed video signal into multiple sub-signals, coding each of the sub-signals and transmitting each of them over asynchronous transfer mode (ATM) paths, receiving each sub-signal, and selecting certain of the sub-signals according to a bandwidth suitable for subsequent reception over a digital subscriber line path. The step of combining selected sub-signals is based on the data rate capacity of the digital subscriber line path.
In another disclosure, the article "Low-Complexity Video Coding for Receiver- Driven Layered Multicast" by Steven McCanne, Martin Vetterli, Van Jacobson (IEEE Journal of Selected Areas in Communications, Vol. 15, No.6, August 1997, pages 982- 1001 )
http://lcavwww.epfl.ch/publications/97/postscripts/MccanneVetterliJacobson.pdf
describes a system in which the burden of rate adaptation during real-time media transmission may be placed on the receiver rather than the source.
It is clear from both of the above disclosures that the source distributes EACH of the multiple layers simultaneously across network channels such that they may all be received by a receiver, after which the receiver adapts its reception rate by adjusting the number of layers that it receives.
In UK patent application GB 2,379,295, there is disclosed a system for distributing audio or video material to a potential buyer. The system generates an impaired version of the material to be distributed, thus allowing a facility for the buyer to sample the material before acquisition, but the impaired version is generated by adding data to the original material, leading to effects such as added visible or invisible "watermarks".
A further prior art patent application EP 1 ,260,898 relates to methods for the authentication of files. Two separate concepts are disclosed, the first of which is said to involve making minor amendments to an output file which are designed not to attract attention (i.e. designed not to degrade the quality of the file), but to be detectable for the sake of proving the file's origin in case of infringement of the file's copyright. The second concept relates to the adding of data to the original material, in much the same way as is disclosed in GB 2,379,295, discussed above.
Summary of the Invention
An aim of the present invention is to provide methods and systems for distributing data files which allow a data distributing party to derive a degraded file from the original data file and distribute the degraded file to one or more potential recipients thereof, while at least temporarily withholding a supplementary file, also derived from the original data file, which contains data from the original data file that is not included in the degraded file. Such a degraded file may contain a sufficient proportion of the content of the original data file, as determined by the party distributing the data, to allow potential recipients to have access to sufficient data to "preview" the whole content of the original data file without having access to a fully usable copy of the original data file, and without the need for the content of the data file to be encrypted. Potential recipients who decide that they do wish to receive the whole content of the original data file may become recipients of the full "undegraded" content after agreement with, payment to, or permission from the distributing party, for example, following receipt of the initially withheld supplementary file which may be used, together with the degraded file, to reconstitute the whole content of the original data file.
The present invention is defined in the claims appended hereto, with advantages, preferred features and embodiments which will be apparent from the description and claims.
Prior art techniques for providing a preview of content (video, music etc) in such a way that the recipient can sample the content before buying the full version include pay- per-view movies where a short excerpt of a film is made available to a potential viewer who may view it before being given the chance to buy the whole version. The techniques disclosed differ from this since instead of the recipient receiving a full quality version of an excerpt of the film (content), the recipient may receive a degraded quality version of the whole content of the film for preview and the recipient can pay to receive the full quality version if they wish.
Embodiments of the invention differ from encryption techniques because while encryption effectively "modifies data using a mathematical function", it is never the intention to remove or "lose" data by a process of encryption, nor is it the intention to provide a preview of the content.
Various methods are later disclosed for degrading the quality of a data file. An advantage of many of the disclosed techniques is that the degraded part may be the majority of the content and can be distributed using better and quicker but less secure methods (websites, commercial CD distributions etc) while the supplementary or missing part required to make up the quality can be sent securely, but at low cost, because it is small.
The derivation of the degraded file may be achieved by any of a variety of degradation functions. The degradation function may for example be a function of one or more characteristics of the data from the main data file. Such a degradation function could allow for higher security even before taking into account any encryption since a recipient with knowledge of functions used for previous data files would not necessarily be able to ascertain the degradation function used for further data files. Alternatively, the degradation function could be a function of one or more characteristics of data relating to the recipients, such as their IP addresses. Such degradation functions could allow a means of tracking the source of unauthorised copies should one of many recipients redistribute data files without permission, since respective recipients would be provided with different degraded files, the differences allowing for individual "water-marking" before and even after reconstitution of the original data file.
The degraded files and the supplementary files could be distributed by any of a variety of means. They could be transmitted individually to each recipient by e-mail, or sent individually to each recipient in the form of a floppy disc, CD-ROM or otherwise, or broadcast such as to be available to an open or closed group of potential recipients. Alternatively, they could be made available on-line for download via the internet. The
degraded files and the supplementary files need not be distributed by the same means; the degraded file could be made available on-line, with the supplementary file being sent by e-mail to recipients following payment to the provider, for example. The supplementary file, and in certain cases the degraded file, could be distributed in encrypted form, in order to provide a further level of security. The decryption key or keys could then be provided together with of separate from the respective files.
It will be noted however that embodiments of the present invention facilitate the distribution of files to a new location or to several different locations (geographically or virtually or media) at levels of security which may be chosen according to requirements, while avoiding the need for the full content of the files to be subjected to encryption.
Brief Description of the Drawings
Embodiments of the invention will now be described with reference to the accompanying figures in which: Figure 1 is a schematic diagram representing a standard communications network allowing exchange of data between terminals via the internet;
Figure 2 is a block diagram illustrating the treatment of data according to embodiments of the invention.
Detailed Description of the Invention
With reference to Figure 1 , a conventional personal computer 101 is connected to a network 103 such as a wide area network (WAN) or, more specifically, the Internet. Another computer 105 connected to the WAN 103 acts as a server computer for a data distributing entity, such as a multimedia distributing organisation, for example, offering audio, video, image, text, software or other data in the form of media files for sale to the public. The computers 101 ,105 may be connected to the WAN 103 via Local Area Networks (LANs) 107 coupled with the access to a gateway server computer (not shown) that enables the computers 101 , 105 to access the WAN 103. Alternatively, the connection 107 may be provided via home Internet access such as broadband and telephone line based access. It will also be evident that the computers 101 ,105 need not have different status within the hierarchy of the communications system, since certain embodiments of the invention apply equally to systems for "file-sharing" between individual home PC-users, for example. For ease of explanation however, the computer 101 will generally be referred to as the "recipient" computer, arranged to access the
server computer 105 which will generally be referred to as the "distributing" computer, but this should not be taken as implying any limitation to the scope of the invention.
The recipient and distributing computers have software and hardware to be able to access the WAN 103, an operating system (e.g. Microsoft Windows™) and a web browser (e.g. Microsoft Internet Explorer™, or Netscape Navigator™).
The distributing computer has access to a database of files, which for the purposes of this example may be audio files. These may be stored in any of a variety of known types of memory. Individually, in groups, or en masse, these files are offered in "preview" form to the users of computers such as the PC 101 according to the following procedure. First each file is split according to any of a variety of techniques into a degraded "preview" file and a supplementary "key" file. Techniques for splitting the data will be described in the following section of the description, which includes a variety of techniques for splitting data files into component files (and subsequently reconstituting them) which suit differing requirements for security, diversity of location or transmission method. Next, the degraded preview file is made available to potential recipients in any of a variety of ways, in this example, by including it in an e-mail to the recipients. The preview file is degraded in order to allow the recipients to decide, without having access to a full-quality copy of the original audio file, if they wish to purchase a copy of the original file. Recipients who decide not to proceed further need take no further action - they will have a degraded copy only, of insufficient quality to encourage further listening or re-distribution. Recipients who decide that they would like to receive a full-quality copy of the original audio file may contact the distributor by e-mail or otherwise and may be asked to pay a fee, in return for which the distributing computer will make the supplementary file available to them together with instructions for reconstituting the degraded file and the supplementary file into a copy of the original file.
With reference to Figure 2, an original or "main" data file from a database or "library" of files under the control of a distributor may be subjected to encoding by an optional encoder 201 , which may compress or otherwise encode the data. The data is then split by a file splitter 202 which splits the data according to a degradation function into:
- a degraded "preview" file, which relates to, or comprises (if in un-encoded form, or once decoded, if necessary) a proportion of the data (ninety per cent, for example) sufficient to provide a low-quality version of the whole of the original data file; and
- a supplementary "key" or "completion" file which relates to or comprises the remainder of the data from the original data that was removed during the derivation of the degraded file. While the file splitter is represented in Figure 2 by a single box element 202 representing a data processing unit which derives the degraded file and the supplementary file from the original data file, it will be apparent that the role of the file splitter could equally well be performed by two separate data processing units 202 each of which receives the original data (encoded or un-encoded), one of which derives the degraded file and the other of which derives the supplementary file from the original data. In such an embodiment the functions performed by the respective file splitters need not correspond exactly. If they do correspond, and split the file according to corresponding functions, it can be ensured that exactly all of the data not contained in the degraded file will be contained in the supplementary file, however.
The degraded file may then be distributed as a "majority data stream" to potential recipients using e-mail, the internet, any type of broadcast, multicast or unicast system, in memory such as CD-ROM, or by any other suitable means. Additionally, a "metadata file" containing all of the required information to allow the later recombination of the majority data stream with a "minority data stream" containing the supplementary file data may be distributed with the preview file, as a header, for example. The metadata file may include a reconstitution algorithm and instructions as to where to get the minority stream from, for example. This is an optional component, due to the fact that if the splitter and combiner have an agreed splitting strategy and shared knowledge of file locations, this file is not required.
The degraded file data is received by recipients who, in the absence of supplementary file data are able to preview the content of the original data file at low quality. Such recipients may be regarded as potential recipients of the original data file. Only following a request from and/or the meeting of conditions by a recipient will the recipient also receive the supplementary file data, which may then be processed by file combiner 203 in conjunction if necessary with the metadata file, and passed if necessary to decoder 204. It will be noted however that with certain embodiments of the invention the recipient is not required to have any additional specialised software other than a standard decoder where necessary (Windows Media™, MPEG etc) in order to be able to receive and take advantage of this preview. Only if they wish to complete the stream may the user require combiner software to combine the data streams.
As will be apparent from Figure 2, embodiments of the present invention may be used to distribute data files by splitting data files in real time on a video stream over an IP network. The majority stream may be sent through more efficient low overhead multicast, with lower security. The minority stream may be sent over the higher overhead but higher security unicast stream. This stream is the one which provides access control whilst the multicast stream enables previewing for all recipients.
It should be noted that while the specific embodiments described above involve splitting of the original data file into one degraded file and one supplementary file, it is foreseeable that according to some embodiments there may be two or more supplementary files each containing data from the original data file which does not form part of the degraded file. Such embodiments may allow different quality levels to be available, or may allow for variable security levels or other optional features.
File Splitting Technigues File splitting techniques based on two main concepts will be described in detail, however it will be apparent to the skilled addressee that other types of file splitting are also applicable. The two main concepts are algorithmic splitting (where a mathematical process is used to derive the split files) and content-based splitting (where the splitting function is dependent on the data itself). Either of these splitting techniques can be 'personalised' to a specific end-user or recipient using a number of methods, so that the files which are transmitted cannot be reconstructed if intercepted.
These processes lead to the formation of a file splitting technique: application based splitting, which uses uneven file splitting to produce a secure degraded file allowing the ability to preview at low quality, the supplementary file serving as a full quality unlocking "key" file.
According to embodiments of the present invention, single data files which contain content of some form (for example audio or video) may be split into two or more 'component' files which can then be transmitted by any means, either physical (a CD- ROM, DVD, Smart-Card etc) or virtual (Network, Hard Disk Download, FTP, etc). The transmission does not need to be simultaneous or synchronised in any way, unless realtime transmission is required.
The content is thus not necessarily ever transmitted in a complete form over a single transmission medium, nor is it ever transmitted as a single file (encrypted or otherwise - it is debatable if certain types of conventional encryption truly provide any additional protection of the content). The component files could be different for each
intended recipient by using a unique or personalised aspect of the recipient's location, URL, email address, Ethernet port number etc.
Where application based splitting is implemented, we can go further than this and say that the splitting process produces a number of files. One of these files will act as a small sized "key" file to the other files. The solution proposed offers an alternative solution to encryption. It can be implemented to allow a compromise between low bandwidth consuming, but less secure access mediums (multicast, CD distribution) and higher security but lower efficiency mediums (unicast). This solution offers a trade off that gives both good security and bandwidth requirements.
Algorithm-based splitting
The simplest algorithm for splitting a data file into two would be to de-interleave successive bytes or words into two separate files. For the case of 16 bit words, then this would turn the sequence of words:
00 FF 01 FE 02 FD 03 FC 04 FB 05 FA 06 F9 07 F8 08 F7 09 F6 OF F5 0A F4 0B F3
into two files containing:
Filel : 00 01 02 03 04 05 06 07 08 09 OF 0A 0B File2: FF FE FD FC FB FA F9 F8 F7 F6 F5 F4 F3
There are a large number of potential methods based around splitting using other de-interleaving techniques. The technique shown above splits the file with each successive byte being sent to the output files alternately. This can be expressed as 1 :1 , where the two numbers indicate the number of bytes which are sent to each file in turn.
An algorithm such as 1 :2 would take this sequence of words:
01 FF FE 02 FD FC 03 FB FA 04 F9 F8 05 F7 F6 06 F5 F4 07 F3 F2 08 F1 F0
and produce this output:
File 1 : 01 02 03 04 05 06 07 08
File 2: FF FE FD FC FB FA F9 F8 F7 F6 F5 F4 F3 F2 F1 F0
At the receiver, the algorithm is reversed so that the two split files are interleaved again in the correct sequence, and the original file is thereby reconstructed.
The splitting algorithm need not be restricted to simple fixed interleaving values. By using a scheme expressed as n:n (n=1 at start), then the split ratios will be 1 :1 for the first two words, then 2:2 for the next four words, then 3:3 for the next six bytes, etc. By setting the start value of n to other values, then additional splitting algorithms are available. Algorithms like n:n+1 (n=1 at start) and other extensions are also possible.
Extending this formula-based technique offers splits based on other mathematical series constructs like the Fibonacci series ( 0, 1, 1, 2, 3, 5, 8, 13, ... the rule is 'add the last two to get the next' ) giving algorithms of the form F+1: F+1 or F+1 : F+n , (n=2 at start).
Content-based splitting.
By making the split algorithm dependent on the content itself, then the splitting becomes dynamic. For example, taking the sequence:
01 FF 02 FE FD 01 FC 03 FB FA F9 00 02 F8 F7 01 F6 06 F5 F4 F3 F2 F1 F0
and using the first number to determine the number of words which are sent to the second file, then the split becomes:
File 1 : 01 02 01 03 00 02 01 06
File 2: FF FE FD FC FB FA F9 F8 F7 F6 F5 F4 F3 F2 F1 F0
This can be expressed as 1 :=, where the equals sign refers back to the first file number value.
The algorithm =:= uses the first value as the splitting value for the next set of words, and then the next value as the next splitting value: So the sequence of words:
01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A ...
would result in a split into the following two files:
File 1 : 01 04 05 06 07 OF 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1 D
File 2: 02 03 08 09 OA OB OC OD OE 1 E 1 F 20 21 22 23 24 25 26 27 28 29 2A
In order to encode the technique being used by the splitting algorithm, then one or more words could be pre-pended to the start of one of the split files to indicate the technique being used. Alternatively these words could be sent as a separate file, or even included as part of the physical storage on a CD-ROM, or Smart Card. The splitting technique could even be derived from the location information for the recipient, by mapping the location information to the splitting algorithm:
For example, if the Ethernet Address of the recipient is 147.132.39.57, then the final digit, 57 could be used to determine the splitting algorithm used. Modulo arithmetic could be used to convert the 0-255 range of the Ethernet address to any smaller range of potential splitting algorithms. Alternatively, the entire address (or a sub-set of it) could be used to produce a number which determines the splitting algorithm: either by adding the numbers together, or by any other arithmetic or Boolean process to produce a result number which is of the correct range to select a splitting algorithm.
In this way, the selection of the splitting algorithm can be made location-specific. Additional protection could be provided by use of public and private key encryption techniques which are widely known. One approach would be to encrypt each split file individually.
Application based splitting
This technique uses the above file splitting techniques to implement access control for multimedia files.
This splitting could be based upon algorithmic streaming, for example using an interleaving scheme of 1 :n, where n»1. This would result in file splitting as below:
00 01 02 03 n-1 FF n n+1 n+2 ....2n-1 FE 2n...3n-1 FD 3n....
into two files containing:
Filel : 00 01 02 ...n-1 n n+1 n+2 ....2n-1 File2: FF FE FD
The two files created from this operation are vastly different. File 1 has the majority of the bits of the original file in it as shown above. Only one in n packets is in the second file so it is far smaller.
If an appropriate value of n is chosen then File 1 will have almost all the content but the removal of every 1 in n or every 1 in (n+1 ) bits will mean that it has sufficient bit errors to reduce the quality of the content (e.g. clicks in audio, observable bit errors in video).
File 1 can be freely distributed without encryption because it is low quality in the sense that it appears to contain many bit errors, and this suits its distribution through less expensive forms (in terms of cost, bandwidth, latency to the end user), which offer lower security, (e.g. CD, broadcast, multicast). File 2 is effectively a "key" for file 1. Its distribution will be higher protection (authentication, encryption) and/or through a more secure transmission (unicast). It is low rate and hence any heavyweight security required for this file will not be so prohibitive in terms of bandwidth, processing cost and so on. Furthermore if the correct value for n is chosen then there will be enough apparent bit errors in File 1 for it to be low quality whilst still being a stream that can be previewed without File 2. This would be particularly useful in video distribution.
The above example illustrates the main concept of application based splitting. That is, trying to take very small amounts of data out of the original file to sufficiently reduce its quality for there to be "degradation".
An alternative to using algorithmic based splitting is a content-based degradation system. In this type of system the choice of the small number of bits for File 2 is taken on the basis of significant coding frames (in video and audio this could be intra blocks or other control headers), significant streams (MPEG4 has multiple streams, degrading a particular stream say the foreground stream would be sufficient) or significant content (relevant frames, area of an image).
This type of splitting enables access control of premium content. It could be used in conjunction with other technologies to completely secure the media (e.g. digital watermarking might be additionally implemented to allow further copy protection). Where this scheme excels is in access of media such as video or audio over networks. An unauthorised passive viewer trying to watch a complete file will be put off by what appears to be excessive bit erroring in the stream.
An example implementation of application based file splitting carries out splitting in real time on a video stream over an IP network. Using interleaving techniques, the high rate stream (File 1 above) is sent through more efficient multicast, which offers lower
security. The lower rate stream is sent over the high overhead but higher security unicast stream (File 2 above). This stream is the one which will provide access control whilst the multicast stream will enable previewing for all viewers.
It will be understood by those skilled in the art that the apparatus that embodies the invention could be a general purpose device having software arranged to provide an embodiment of the invention. It could be a single device or a group of devices and the software could be a single program or a set of programs. Furthermore, any or all of the software used to implement the invention can be contained on various transmission and/or storage mediums such as a floppy disc, CD-ROM, or magnetic tape so that the program can be loaded onto one or more general purpose devices or could be downloaded over a network using a suitable transmission medium.
Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising" and the like are to be construed in an inclusive as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to".