WO2011007141A1

WO2011007141A1 - Distributed storage

Info

Publication number: WO2011007141A1
Application number: PCT/GB2010/001346
Authority: WO
Inventors: Yerkin Zadauly; Chokan Laumulin; Iskender Syrgabekov
Original assignee: Extas Global Ltd.
Priority date: 2009-07-17
Filing date: 2010-07-14
Publication date: 2011-01-20
Also published as: GB2467989A; GB0912508D0; GB2467989B

Abstract

Method and apparatus for securely storing data comprising the steps of receiving one or more communications identifying a plurality of tested and available remote separate storage locations. Dividing the data into a plurality of data subsets. Allocating each one of the plurality of data subsets to a different one of the plurality of available remote separate storage locations. Transmitting and storing each one of the data subsets at the allocated available remote separate storage location. Transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets. A method and apparatus for retrieving data comprising the steps of receiving one or more communications identifying a plurality of tested and available remote separate storage locations. Retrieving from one or more of the available remote separate storage locations information specifying the allocation on the remote separate storage locations of a plurality of data subsets forming the data. Retrieving the data subsets from allocated remote separate storage locations. Combining the retrieved data subsets to form the data. A method and apparatus for allocating separate storage locations comprising the steps of testing the availability of each of a plurality of separate storage locations. Recording the results of the tests together with identifiers of each separate storage location. Transmitting one or more communications identifying the available separate storage locations.

Description

DISTRIBUTED STORAGE

Field of the Invention The present invention relates to a method and system for storing and retrieving data.

Background of the Invention Organisations require reliable and robust data storage facilities within which to store large quantities of data. These data must also be stored securely especially when this relates to transactional data that may be common in

financial institutions.

As data requirements increase, dedicated data centres may meet these requirements. Very large organisations may build their own data centres and smaller organisations may lease portions of data centres or resources within them. In any case, the overheads involved in the upkeep of data centres may become very large. These costs may include power, heating, maintenance and hardware costs. Although data centres can be designed to provide secure and reliable facilities for organisations, their cost may become

prohibitive. Furthermore, as data requirements increase, so to must the size and complexity of data centres to support this growth.

For smaller organisations that cannot afford their own dedicated data centres, they must place their trust in third-party suppliers of the data resource. This trust ultimately depends on the integrity of individuals, who may be custodians over critical data. This situation may not be optimal for certain organisations or particular data types. To some extent these requirements may be met by online storage facilities.

The Wuala system (http://www.wuala.com/en) provides online storage facilities relying on encryption to provide secure storage. The Wuala software encrypts data files on a user's machine and then sends the encrypted files across the Internet. Wuala's storage includes portions of users' unused hard disks.

WO 2007/120437 describes an algorithm for dividing data into data slices. The original data may be reconstructed from a subset of these slices provided that the subset comprises more than a required number of slices. The slices are stored on storage nodes. A metadata management system stores metadata (including describing the way in which data are dispersed amongst different storage nodes) . Therefore, access to the metadata management system enables access to the original data, which may not be appropriate for

sensitive data.

Therefore, there is required a data storage system that overcomes these problems.

Summary of the invention

Against this background, and from a first aspect, the present invention resides in a method of securely storing data comprising the steps of:

receiving one or more communications identifying a plurality of tested and available remote separate storage locations;

dividing the data file into a plurality of data

subsets; allocating each one of the plurality of data subsets to a different one of the plurality of available remote

separate storage locations;

transmitting and storing each one of the data subsets at the allocated available remote separate storage location; and

transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets. In other words, the users can retain information regarding where each data subset is stored and such information or metadata may be necessary to retrieve and recombine the data subsets. This separation of information can enhance security, whilst benefiting from distributed and remote storage. The method may be carried out by a client of a distributed storage system.

The data may be a data file, digital data, content, document or documents or any other computer readable or transmittable material.

Optionally, the method steps may be carried out on a single machine, computer or group of computers connected to a network such as an intranet or the Internet. The

information specifying the allocation of each of the data subsets or metadata may be stored locally, i.e. on the same machine that carries out each method step or within a group of machines or computers tasked to execute the method. The information specifying the allocation of each of the data subsets may be stored in a number of ways. For example, this information may be stored as a simple file or as an entry in a database. Alternatively, the information maybe divided into metadata subsets such that they may be

regenerated from an incomplete set of the metadata subsets by using parity data and a suitable logical function, e.g. XOR. These metadata subsets and parity data may be further divided into further subsets and further parity data to increase redundancy and safety in case of data corruption.

As a further alternative, the metadata (or first metadata) may be divided and stored in a similar way to that of the data, i.e. allocated as subsets to remote separate storage locations. The received communication may further specify which remote separate storage locations are to be used for data and which are to be used for storing first metadata. Additional, further or second metadata or

information may be generated to specify how the first metadata is allocated amongst the remote separate storage locations. This second metadata may also be stored locally on the machine, computer or group of computers that carries out the method or stored in another way or may act as a key especially when encrypted.

Optionally, the available remote storage locations may be identified by Internet protocol, IP, addresses. This may be in the form of a list of IP addresses.

Optionally, the plurality of data subsets may be allocated to the remote separate storage locations randomly. This may improve utilisation of the storage locations and may further enhance security by making it more difficult to predict or guess where data are stored.

Optionally, the plurality of data subsets may be allocated to a sub-group of the plurality of available remote separate storage locations. Therefore, any central server or administrator may not be able to determine where specific data subsets are stored, further enhancing

security.

Preferably, the remote separate storage locations are physically separate storage locations. This improves reliability by reducing the risk of multiple storage locations becoming unavailable at the same time due to failure of a particular facility.

Preferably, the receiving and transmitting steps occur over a network.

Preferably, the network may be selected from the group consisting of: an intranet, the Internet, a wide area network and a wireless network. These network types allow different types of storage locations to be utilised.

Optionally, the remote separate storage locations may be selected from the group consisting of personal computer, hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server. Other storage types may be used.

Optionally, the method may further comprise the step of testing a connection to each remote separate storage

location. This improves reliability by monitoring

availability, especially over a network. The testing of a connection may be carried out by a device or server remote (or on another part of a network or the Internet) from the client or user storing the data. In other words, the storage method and testing steps may be carried out

separately and independently so that the device or server that carries out the testing steps may not or cannot access the stored data or has enough information to retrieve or read the stored data.

Optionally, the method further comprises the step of maintaining a list of storage identifiers corresponding to remote separate storage locations that pass the connection test. This may also be carried out by a separate device or server from that of the client or user device that stored data in the remote separate storage locations.

Preferably, the received one or more communications may be generated from the maintained list. Optionally, the list may be stored in a database or in other form.

Optionally, the method may further comprise encrypting each data subset and/or the information specifying the allocation of each of the data subsets prior to

transmission. This further improved security as the data may only be decrypted with a key stored by the user. The data may also be encrypted before division.

Optionally, the information specifying the allocation of each of the data subsets may be transmitted in the form of a file allocation table, FAT, file. Therefore, this information may be in a standardised format.

Optionally, the data or data file may be recoverable from a subset of the allocated remote separate storage locations and the method further comprises the steps of: receiving a further communication identifying a

plurality of available remote separate storage locations; and

if more than a predetermined number of the allocated remote separate storage locations are missing from the further communication or indicated as being no longer available then retrieving the data subsets from available allocated remote separate storage locations identified in the further communication, recreating the data from the retrieved data subsets, and repeating the dividing,

allocating and both transmitting steps on the recreated data using the available remote separate storage locations identified in the further communication. Therefore, the recoverability of the data may be increased as remote separate storage locations become unavailable or as further locations become available.

Optionally, dividing the data or data file into a plurality of data subsets further comprises the steps of: a) separating the data into a plurality of separated subsets;

b) generating parity data from the plurality of separated subsets such that any one or more of the plurality of separated subsets may be recreated from the remaining separated subsets and the parity data; and

c) repeating steps a and b on each of the plurality of separated subsets and parity data providing the data subsets consisting of further separated subsets and further parity data. This is one way of dividing the data and improves recoverability, redundancy and security.

Optionally, step c) may be repeated for each of the plurality of data subsets and parity data. Therefore, the data may be cascaded further increasing reliability.

Optionally, the data are separated bit-wise.

Alternatively, the data may be separated byte-wise.

Optionally, the parity data are generated by performing a logical function on the plurality of data subsets.

Optionally, the logical function is an exclusive OR. The XOR function is particularly computationally efficient.

Optionally, the method may further comprise the step of recording where on the available remote separate storage locations the information specifying the allocation of each of the data subsets is stored. This information may be stored with the user and may also be used as the starting point for retrieving data for the remote separate storage locations. In other words, the user may store information that records where a FAT file or equivalent information is stored on the separate remote storage locations and may further record how this FAT file or equivalent is itself divided into data subsets and where and how it may be retrieved and recovered. This information may be fairly brief and therefore small and more portable but may act as a key for recovering much larger data.

According to a second aspect there is provided a method of retrieving data comprising the steps of:

retrieving from one or more of the available remote separate storage locations information specifying the allocation on the remote separate storage locations of a plurality of data subsets forming the data;

retrieving the data subsets from allocated remote separate storage locations; and

combining the retrieved data subsets to form the data. This method may be used to recover or retrieve stored data according to the first aspect.

Optionally, the information specifying the allocation on the remote separate storage locations may be in the form of a file allocation table, FAT, file. Other formats may be used.

Optionally, the retrieving the data subsets step may further comprise the steps of:

a) retrieving parity data from the separate storage locations;

b) recreating any missing data from the retrieved data subsets and parity data to form the recreated data; c) combining the retrieved data subsets and any recreated data to form a plurality of consolidated data sets, wherein the plurality of consolidated data sets include further data and further parity data; and

d) recreating any missing further data from the further data and further parity data to form recreated further data; and e) combining the further data and any recreated further data to form the original data. This improves resilience to loss of data that may be due to missing, failed or otherwise unavailable separate remote storage locations used to store the data subsets.

In accordance with a third aspect there is provided a method of allocating separate storage locations comprising the steps of:

testing the availability of each of a plurality of separate storage locations;

recording the results of the tests together with identifiers of each separate storage location; and

transmitting one or more communications identifying the available separate storage locations. This allows storage locations to be allocated based on reliability and may also improve the recoverability of data and storage by

continually or repeatedly testing availability. Preferably, the testing is carried out by a device, server or

Consolidator or group of such devices that is remote, separate and independent from devices or users that store data on the separate storage locations. This separation, especially over a network or the Internet, provides

additional security for the users' data, limiting its retrieval to the device that stored it or to a device with access to information describing how the data were stored.

Preferably, the testing the availability step may be repeated at intervals. The intervals may be for instance, hourly or daily.

Optionally, the method may further comprise the step of receiving an identifier of a further available separate storage location to include in the testing of availability. Therefore, newly available storage locations may be added or re-added as they appear or their owners provide new storage locations to the system.

In accordance with a fourth aspect there is provided system for securely storing data comprising:

a plurality of separate storage locations;

an availability tester arranged to test the

availability of each of the plurality of separate storage locations, record the results of the tests together with identifiers of each separate storage location and transmit one or more communications identifying the tested and available separate storage locations; and

a data manager arranged to receive the one or more communications identifying the tested and available separate storage locations, divide the data into a plurality of data subsets, allocate each one of the plurality of data subsets to a different one of the available separate storage locations, transmit and store each one of the data subsets at the allocated available separate storage location, and transmit to one or more of the available separate storage locations information specifying the allocation of each of the data subsets.

Optionally, the data manager may be further arranged to retrieve the information specifying the allocation of each of the data subsets, retrieve the data subsets from the allocated separate storage locations and combine the retrieved data subsets to form the data.

The methods may be embodied as computer programs or programmed computers or stored on a computer-readable medium or as a transmitted signal. Brief description of the Figures

The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of a system for storing data, given by way of example only;

FIG. 2 shows a schematic diagram of the system of Fig. 1 in greater detail;

FIG. 3 shows a portion of the system of Fig. 2 in further detail;

FIG. 4 shows a schematic diagram of a portion of the system shown in Fig. 2 in further detail;

FIG. 5 shows a portion of a database schema used in the system of Fig. 1;

FIG. 6 shows a schematic diagram illustrating a method for storing data on a system of Fig. 1;

FIG. 7 shows a schematic diagram of the method steps used to distribute data on the system of Fig. 1;

FIG. 8 shows a schematic diagram of a portion of the system of Fig. 1 in further detail;

FIG. 9 shows a schematic diagram of a portion of a system of Fig. 1 including components used to test storage locations;

FIG. 10 shows a flow diagram of a method used to retrieve maintain data within the system of Fig. 1;

FIG. 11 shows a partial database schema used by the system of Fig. 1; and

FIG. 12 shows a further partial database schema used by the system of Fig. 1.

It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale. Detailed description of the preferred embodiments

Fig. 1 is a schematic diagram of a storage system for storing and retrieving data. The storage system 10 allows a user to send and retrieve data from a user computer 20 to one or more remote separate storage locations 50.

A central server 30 monitors and tests the availability and connectivity of the separate storage locations 50 and records this information in a database 40. The separate storage locations 50 are identified by their internet protocol (IP) addresses. These IP addresses may also be stored in the database 40. For instance, the central server 30 may monitor the period that a storage location 50 is available and the type and speed of connection that may be made to each storage location 50 at any particular time. The testing may be carried out at intervals.

The separate storage locations 50 may be individual and separate personal computers each having a hard drive 60 or other storage device. Within each hard drive 60 there may be an allocated portion 70 made available to the storage system 10. The allocated portion 70 may be partitioned or otherwise made separate from the remaining portion of the hard drive 60 and so unavailable for use by the particular personal computer except for the purpose of remote storage.

The central server 30 sends a message or communication across a network link 90 to the user computer 20. The message contains a list of the IP addresses of each

available separate storage location 50. Therefore, the user computer 20 receives a message containing the IP addresses or other identifier of each available remote separate storage location 50 to which it may send and retrieve data. A software application 99 running on the user computer 20 divides any data to be stored into separate data sets. Each of these subsets of data may be transmitted across a data link 100 to the allocated portion 70 of the hard drive 60 of a particular personal computer (or other computer or server) making up the separate storage locations 50.

The software application running on the user computer 20 may determine which particular data subsets to store on each available separate storage location 50. It may not be necessary to use all of the available separate storage location 50 detailed in the communication sent from the central server 30, especially if the data to be stored is smaller than the available storage space from the total number of the separate storage locations 50. The data subsets may be allocated to the separate storage locations 50 randomly, for instance. The software application may also record information identifying which data subsets were stored on which particular available remote storage

locations 50. This information may be stored in the form of a File Allocation Table, FAT, file or in another suitable format. The software application may provide the central server 30 with an indication of its data requirements including size are reliability or availability level.

The data to be stored may be divided in such a way that each individual data subset cannot be used to recreate the original data without a minimum number of data subsets.

Furthermore, the data may be divided into data subsets such that the loss of any particular data subset may be tolerated without resulting in a total loss of original data.

The network and data connections may be made across an extended network 110 such as an intranet or the Internet, for instance. The availability and reliability of each separate storage location 50 may be tested at regular intervals and as further separate storage locations 50 become available, their details and IP addresses may be added to the database 40 and updated messages may be sent to the user computer 20 as this information changes.

Furthermore, the integrity of the stored data does not require all separate storage locations 50 to be available at the same time. Therefore, the reliability of the entire data storage system 10 does not require highly available storage locations, such as those provided by a data centre.

The owners of individual personal computers or perhaps computers on a corporate network may make the allocated portion 70 of their own internal hard drives 60 available to the system 10 in return for benefits, payments or the use of other services. These owners may sign up through Internet sites, social networking sites or by running downloaded software, for instance. To enhance security further, the data subsets may be encrypted before being sent over the data network 100 to the separate storage locations 50.

Retrieval of the original data from the separate storage locations 50 may be carried out by retrieving each data subset (or at least the minimum number required to

regenerate the data) . Once retrieved, the data subsets may be recombined within the user computer 20. Any missing data subsets, perhaps due to the unavailability of

particular personal computers, may be recreated from the remaining data subsets, if necessary using a suitable algorithm. Therefore, reliability and security may be maintained within the data storage system 10 without the need for dedicated data centres. The FAT file may be used to retrieve and regenerate the data. Furthermore, the FAT file may also be stored on one or more of the remote

separate storage locations 50. The user computer 20 may act as an administrator for a group of computers on a network so that the stored data may be gathered from more than one computer.

The personal computers making up the remote separate storage locations 50 may be left on and connected to the Internet 110 for extended periods to maintain their

availability resulting in increased availability, as

monitored by the central server 30. Should a separate storage location 50 fail an availability test a

predetermined number of times or not be available for a predetermined period then testing of that particular

separate storage location 50 may cease and its IP address may be removed from the database 40 or marked as

unavailable. The owner of that particular removed separate storage location 50 may be notified of the situation

allowing them to improve the availability of this particular location so that the that particular remote separate storage location 50 may be returned to the list of available

locations at a future time.

Although Fig. 1 shows a single user computer 20 and three separate storage locations 50 there may be many more of each. Furthermore, there may be multiple central servers 30 each allocated to a group of user computers 20 and/or separate storage locations 50.

The central server 30 may allocate to each user computer 20 a subset of IP addresses relating to a subset of the available separate storage locations 50. The number of allocated IP addresses may be proportional to the storage requirements of the user computer 20 with more IP addresses allocated for higher usage user computers 20.

Each allocated portion 70 may be of a different size and information regarding the total capacity, the used capacity and/or the remaining capacity may be determined and stored in the database 40 of the central server 30.

Therefore, allocation of storage may also be made based on the available storage space for each remote separate storage location 50. The owners of the remote separate storage locations 50 may change the amount of storage space

allocated and these changes may be communicated to the central server 30.

Users of the service may include home users and

corporate users. Home users may require simple backup of files or convenient and secure file sharing. Home users may both share space and may use the distributed storage space. The central server 30 may administer different security or reliability ratings required by each user. This may result in home users, for instance, being provided with lower security and reliability levels than those offered to corporate users. This may allow different types of

dispersal and retrieval algorithms to be executed by

different types of users.

Corporate Users may be companies needing secure

storage. This type of user may require storage with higher rating and reliability to ensure security of data retrieval and a higher quality network. Furthermore, corporate users may also have the option to use an entirely isolated and dedicated service, i.e. a full network of dedicated remote separate storage locations 50 and one or more dedicated central servers 30 also known as Consolidators . Each

Consolidator may be a single central server 30 or group of servers. The system may comprise one or more Consolidators 30 for load balancing purposes. Dedicated Consolidators 30 may be provided for a particular purpose or group of users. A complete system of Consolidators 30, remote storage and clients may be replicated for this type of user and they may also have a dedicated intranet to link these components. Fig. 2 shows a schematic diagram of a simplified but extended system to that shown in Fig. 1. Fig. 2 shows three user computers 20 of different types. User computer or domain 120 represents a home user domain 130 and 140

represent company users. Both domain types may operate on servers 98 or other computers. The requirements for the home user 120 may be relatively small and those of the company users 130, 140 may be higher. In each case, the users or clients may use storage space from other users and may optionally provide space to others.

The operation of the system may be explained by

reference to an example scenario. A user of the system may wish to save a particular file. In order to do this, the user may use the software application 99 to upload the file over the Internet 110. The first step is authentication (via a connection indicated by a dotted line between the user computer 20 and the Consolidator or central server 30) and this is shown schematically in further detail in Fig. 3. Data are transmitted as data subsets through the Internet 110 as indicated by the solid lines. The Consolidator 30 tests connections and availability of the home user and company domains 130, 140 used to provide the separate storage locations 50 using further Internet transmissions as shown by dashed lines.

The client domain 120 sends a username (User) , password (PWD) and GUID to the central server or Consolidator 30 for authentication .

Actors in the system may refer to each other with a GUID that is a unique ID assigned to each installation of the software application. This may be independent of user and IP address. The IP address of the actor may be used during a procedure, when an IP Address is needed in a communication. A remote storage location 50 can change its IP address, for instance, especially where dynamically allocated IP addresses change whenever a home user

disconnects and re-connects to the Internet. However, the GUID will remain constant.

Following successful authentication the Consolidator 30 may send the user a communication or storage list containing a list of available remote storage locations 50. This may be in the form of a list of IP addresses and may be

subdivided into two groups: the first group may be the Data IP Collection list (DAT), i.e. a list of storage locations where users' data may be stored. The second list may be the FAT IP Collection (FAT), i.e. where the FAT files may be stored.

The FAT IP Collection may contain a list of all IP Addresses of remote storage locations 50 where the client can find a list of uploaded files and other information to recreate its virtual directory tree. Therefore, the user may use the system from different PCs and can retrieve all its uploaded files with the directory structure created. With this arrangement, no sensitive information needs to be stored at the Consolidator 30. The FAT file may be saved on the system in a similar way to other data files. Therefore, the FAT file can also be stored securely. Furthermore, the FAT file may be written or encoded so that only the client can read it.

Fig. 4 illustrates schematically how the client may retrieve and reconstruct information forming the FAT file. In summary, the client domain 120 is provided with a list (storage list Uid/IP) of available remote storage locations 50 in the form of IP addresses 1. The client domain 120 retrieves from the available remote storage locations portions of data that may be combined to form a recreated FAT file. The retrieval of data subsets is indicated schematically by arrows 2. The data subsets are combined or consolidated using an algorithm 3 to form a complete data set and if required, regenerating any missing or damaged data subsets from retrieved parity data and available data subsets. The dispersal algorithm is described in further detail with reference to the Fig. 7. These portions may be encrypted and so decryption 4 may be required before the FAT file may be used. The user may store information specifying how and where the FAT file is stored.

Fig. 5 shows a schematic diagram of various tables of a database schema used in the FAT file reconstruction process. tb_Storages contains the IP addresses of all installations or domains providing allocated portions 70 of available storage space. tb_Users contains all registered users.

tb_LNK_Users_Storages contains information about which user is using a storage and the scope of that storage (i.e. for files and/or FAT) . Following a successful authentication, a Consolidator 30 generates a list of IP address to send to a client domain 120 allowing retrieval of the portions of that user's particular FAT file. Other information may be stored in these database tables.

Should a hacker impersonate a user in order to obtain the FAT file information they will be prevented from do so as at least parts of the FAT file may be encrypted. To decrypt the FAT file (and any or all other files that may be retrieved) a decryption key is required. Alternatively, the information specifying how and where on the separate storage locations the FAT file is stored may be required before the FAT file can be reconstructed.

The decryption key may be a passkey string, a biometric signature or other suitable data. The decryption passkey should preferably be strong and not saved on any part of the system. Therefore, a user may need to remember the passkey, which itself may be used to retrieve an encrypted file with a further hardcoded key. Under one example implementation ten hardcoded keys may be saved as one or more temporary files (which may be viewed as cookies) . The file may be encrypted using one of these ten hardcoded keys chosen at random. To decrypt the file, the client may try with all ten hardcoded keys. Nine out of ten hardcoded keys will fail but user will not be aware of this. However, such a procedure makes it more difficult for a hacker to obtain or generate the hardcoded keys. In other words, the FAT file may be encrypted using a different hardcoded key each time it is generated or changed.

Security may be further enhanced by the use of

obfuscation of code. The hardcoded keys may be generated using a hardware based procedure. Therefore, the client may recreate a key every time without hardcode (and so readable) keys. Obfuscation provides additional security against hackers. The FAT file may be a flat text file containing information in XML format, for instance.

Fig. 6 shows schematically the procedure for uploading a file or data. Uploading a file may require successful login into the system. When a user attempts to upload a file, he has to indicate which file he wants to upload from his machine (or from any reachable location excluding perhaps Internet locations). The client application then starts a series of operations. The first operation may be to encrypt the file. Encryption may or may not be used.

After file selection, the client application may start to encrypt the file with a given key from the user (for instance using a string or a biometric reading) . Then it starts the dispersal algorithm in order to divide the data in to data subsets, PKGl, PKG2, PKG3, etc. The dispersal algorithm provide a method to split file into several slices .

Fig. 7 shows schematically the operation of the example dispersal algorithm. Each file may be split into three data subsets and each data subset may then be split a further three times, and so on until the number of required data subsets has been reached or each data subset has reached a particular size. Greater division provides additional security and reliability for recreating the original file.

After receiving a communication containing the list of available storage locations 50 the client application may decide where to store each data subset. If the storage locations are not sufficient to store each data subset (or perhaps if a remote separate storage location is no longer available) then a new IP address representing a further storage location 50 may be required. Further separate remote storage locations may be included in a further communication received from a Consolidator 30. In this case, the client application may inform the Consolidator 30 that another IP address will be used.

This operation may be considered as a single logical transaction. Therefore, when a storage location 50 (through its client application) receives a data subset or slice it may need to wait for a signal that all other slices are stored and secured. Failure to receive such a signal after a time out or predetermined period may then indicate that the data subset should be deleted so that a further attempt may be made. The Consolidators 30 are otherwise not used during this procedure.

Successful completion of the logical transaction may further require generation of the FAT file and completion of the FAT file transmission. Otherwise, a complete rollback may be required, such that all data subsets may require deletion if a traceable and retrievable FAT file cannot be confirmed.

Fig. 8 shows a schematic diagram of a data retrieval process. This is similar to the retrieval of the FAT file descried with reference to Fig. 4. However, instead of portions of the FAT file, data subsets may be retrieved from the separate storage locations 50.

Fig. 9 shows schematically how the list of available storage locations 50 is compiled and maintained. Operations are carried out to update data monitoring the reliability of client applications associated with separate storage

locations 50. Alive signals or requests may be propagated to each actor, e.g. each client providing a storage location 50. At a scheduled time, each of these clients may send a signal or communication to a Consolidator 30 informing that they are still alive and online. The Consolidator 30 may receive this signal from a group of clients and send

acknowledgments, accordingly.

The clients may receive these communications or

messages and respond to Consolidators 30, accordingly. This procedure allows Consolidators 30 to test the speed and reliability of the remote storage locations 50. An

historical log of these performances may be preserved in the database 40 of a Consolidator or central server 30. Clients that fail such a test, e.g. by not responding within a predetermined period may be deleted from the database 40 or marked as unavailable. Such clients may also be excluded from the list of available storage locations in any

communications transmitted from a Consolidator 30. Clients may also be added or re-added should they become available and pass the alive test.

As data subsets are stored at remote separate storage locations 50 that may each be controlled by a separate user, particular storage locations 50 may temporarily or

permanently become unavailable. This situation may be mitigated by redundancy incorporated into the system by the data separation algorithm allowing data to be recreated from incomplete sets of data subsets. However, the system requires a minimum number of available data subsets.

As described above with reference to Fig. 9, the availability of each separate storage location 50 is monitored, including those used to store existing data subsets. The client application is kept informed of the availability of these storage locations 50 by receiving communications from the central server or Consolidator 30. When a particular client application determines that a certain proportion of its utilised separate storage

locations 50 are no longer available, action may be required to prevent data loss of any stored data. Under these circumstances the data may be recovered from the remaining available storage locations 50 and redeployed to further available storage locations 50 maintaining redundancy.

Fig. 10 shows a flow chart 200 describing such a procedure. The client receives the communication containing the list of available separate storage locations 50. This is received from a Consolidator 30 and may comprise IP addresses of available remote storage locations for both data and FAT files.

At step 210 the client may recover a particular FAT file from the available separate storage locations 50. The FAT file contains information regarding where data subsets accessible to the particular user, are stored. At step 220 the software application determines if any used storage locations 50 are no longer available. If all are available then there is nothing to do and the procedure ends.

However, if one or more storage locations 50 are no longer available then a determination is made at step 230 as to whether or not a safe level of redundancy remains, i.e. if there is a minimum percentage or number of data subsets available. This minimum may be predetermined and may depend on the required level of security or safety.

The data in question may still reside on the client machine in the form of a cache file. Therefore, step 240 determines if such a cache exists. If no cache exists then the file may be retrieved from the available separate storage locations and recreated using the recovered FAT file at step 250. If the file exists in a cache then it may instead be retrieved from this cache at step 260.

Once the data or file is recovered from either the cache or from the available storage locations 50 (i.e.

according the data download procedure described with

reference to Fig. 8) then at step 270 the data is divided into new data subsets and uploaded to the available separate storage locations according to the method described with reference to Fig. 7, therefore restoring any lost

redundancy. The space allocated to store the original data subsets may then be recovered by deleting these data subsets from the remote separate storage locations 50 at step 290.

Fig. 11 shows a partial schema of the database tables maintained at the central server or Consolidators 30 used to determine how each separate remote storage location 50 is used or what type of data is may store. This schema

includes the database tables described with reference to Fig. 5 but includes table tb_StorageTypes that stores the type of data stored at each separate remote storage location 50. For instance, the type of storage may be FAT file, data subset or describe disk space offered by each user to store other users data subsets and FAT files.

Storage Types can include, for example: • Storage used by User to store parts of FAT file;

• Storage used by User to store parts of upload file; and

• Storage offered by User to host parts of file of other Users.

Fig. 12 shows a further partial database schema similar to that of Fig. 11 but including table

tb_StoragesAliveSignals, which stores data recording which separate remote storage locations 50 are available and may include additional reliability data such as the time that each particular location was tested. Therefore, more recently tested locations may be considered more reliable than those that passed an Alive test less recently.

The quality of each location may also be recorded and statistics of reliability may be maintained. For example, the system may retain only the last "N" alive signals and tests or response may be weighted such that more recent tests of availability may have a greater weighting of reliability. These data may be calculated and updated in real time.

As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.

For example, the remote separate storage locations may include home user or corporate user computers or servers. The central servers 30 or Consolidators may run Microsoft (RTM) Windows Server 2008, SQL Server 2008 and .NET

Framework 3.5 Service Pack 1, for example. The user

computers 20 may run Windows, Linux or Mac OS X operating systems with Framework 3.5 or a version of MONO or other suitable software. The method may be implemented as software formed from a suitable programming language including C++ and any other object oriented programming language. Non-object oriented languages may also be used.

Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention.

Claims

CLAIMS :

1. A method of securely storing data comprising the steps of:

dividing the data into a plurality of data subsets; allocating each one of the plurality of data subsets to a different one of the plurality of available remote

separate storage locations;

transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets.

2. The method of claim 1, wherein the available remote storage locations are identified by Internet protocol, ip, addresses .

3. The method of claim 1 or claim 2, wherein the plurality of data subsets are allocated to the remote separate storage locations randomly.

4. The method according to any previous claim, wherein the plurality of data subsets are allocated to a sub-group of the plurality of available remote separate storage

locations.

5. The method according to any previous claim, wherein the remote separate storage locations are physically separate storage locations.

6. The method according to any previous claim, wherein the receiving and transmitting steps occur over a network.

7. The method of claim 6, wherein the network is selected from the group consisting of: an intranet, the Internet, a wide area network and a wireless network.

8. The method according to any previous claim, wherein the remote separate storage locations are selected from the group consisting of personal computer, hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server.

9. The method according to any previous claim further comprising the step of testing a connection to each remote separate storage location.

10. The method of claim 9 further comprising the step of maintaining a list of storage identifiers corresponding to remote separate storage locations that pass the connection test.

11. The method of claim 10, wherein the received one or more communications are generated from the maintained list.

12. The method of claim 10 or claim 11, wherein the list is stored in a database.

13. The method according to any previous claim further comprises encrypting each data subset and/or the information specifying the allocation of each of the data subsets prior to transmission.

14. The method according to any previous claim, wherein the information specifying the allocation of each of the data subsets is transmitted in the form of a file allocation table, FAT, file.

15. The method according to any previous claim, wherein the data is recoverable from a subset of the allocated remote separate storage locations and the method further comprises the steps of:

receiving a further communication identifying a

plurality of available remote separate storage locations; and

allocating and both transmitting steps on the recreated data using the available remote separate storage locations identified in the further communication.

16. The method according to any previous claim, wherein dividing data into a plurality of data subsets further comprises the steps of:

a) separating the data into a plurality of separated subsets; b) generating parity data from the plurality of separated subsets such that any one or more of the plurality of separated subsets may be recreated from the remaining separated subsets and the parity data; and

c) repeating steps a and b on each of the plurality of separated subsets and parity data providing the data subsets consisting of further separated subsets and further parity data.

17. The method of claim 16, wherein step c) is repeated for each of the plurality of data subsets and parity data.

18. The method to claim 16 or claim 17, wherein the data are separated bit-wise.

19. The method according to any of claim 16-18, wherein the parity data are generated by performing a logical function on the plurality of data subsets.

20. The method of claim 19, wherein the logical function is an exclusive OR.

21. The method according to any previous claim further comprising the step of recording where on the available remote separate storage locations the information specifying the allocation of each of the data subsets is stored.

22. A method of retrieving data comprising the steps of: receiving one or more communications identifying a plurality of tested and available remote separate storage locations;

combining the retrieved data subsets to form the data.

23. The method of claim 22, wherein the information

specifying the allocation on the remote separate storage locations is in the form of a file allocation table, FAT, file.

24. The method of claim 22 or claim 23, wherein the

retrieving the data subsets step further comprises the steps of:

a) retrieving parity data from the separate storage locations;

d) recreating any missing further data from the further data and further parity data to form recreated further data; and

e) combining the further data and any recreated further data to form the original data.

25. A method of allocating separate storage locations comprising the steps of:

testing the availability of each of a plurality of separate storage locations;

transmitting one or more communications identifying the available separate storage locations.

26. The method of claim 25, wherein the testing the

availability step is repeated at intervals.

27. The method of claim 25 or claim 26 further comprising the step of receiving an identifier of a further available separate storage location to include in the testing of availability.

28. A system for securely storing data comprising:

a plurality of separate storage locations;

an availability tester arranged to test the

availability of each of the plurality of separate storage locations, record the results of the tests together with identifiers of each separate storage location and transmit one or more communications identifying the available

separate storage locations; and

29. The system of claim 28, wherein the data manager is further arranged to retrieve the information specifying the allocation of each of the data subsets, retrieve the data subsets from the allocated separate storage locations and combine the retrieved data subsets to form the data.

30. The system of claim 28 or claim 29, wherein the availability tester is further arranged to maintain a list of identifiers corresponding to the separate storage locations that pass the availability test.

31. The system of claim 30, wherein the transmitted one or more communications are generated from the maintained list.

32. A computer program comprising program instructions that, when executed on a computer cause the computer to perform the method of any of claims 1 to 27.

33. A computer-readable medium carrying a computer program according to claim 32.

34. A computer programmed to perform the method of any of claims 1 to 27.