WO2011007141A1 - Distributed storage - Google Patents

Distributed storage Download PDF

Info

Publication number
WO2011007141A1
WO2011007141A1 PCT/GB2010/001346 GB2010001346W WO2011007141A1 WO 2011007141 A1 WO2011007141 A1 WO 2011007141A1 GB 2010001346 W GB2010001346 W GB 2010001346W WO 2011007141 A1 WO2011007141 A1 WO 2011007141A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
separate storage
storage locations
subsets
available
Prior art date
Application number
PCT/GB2010/001346
Other languages
French (fr)
Inventor
Yerkin Zadauly
Chokan Laumulin
Iskender Syrgabekov
Original Assignee
Extas Global Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Extas Global Ltd. filed Critical Extas Global Ltd.
Publication of WO2011007141A1 publication Critical patent/WO2011007141A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/52Network services specially adapted for the location of the user terminal

Abstract

Method and apparatus for securely storing data comprising the steps of receiving one or more communications identifying a plurality of tested and available remote separate storage locations. Dividing the data into a plurality of data subsets. Allocating each one of the plurality of data subsets to a different one of the plurality of available remote separate storage locations. Transmitting and storing each one of the data subsets at the allocated available remote separate storage location. Transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets. A method and apparatus for retrieving data comprising the steps of receiving one or more communications identifying a plurality of tested and available remote separate storage locations. Retrieving from one or more of the available remote separate storage locations information specifying the allocation on the remote separate storage locations of a plurality of data subsets forming the data. Retrieving the data subsets from allocated remote separate storage locations. Combining the retrieved data subsets to form the data. A method and apparatus for allocating separate storage locations comprising the steps of testing the availability of each of a plurality of separate storage locations. Recording the results of the tests together with identifiers of each separate storage location. Transmitting one or more communications identifying the available separate storage locations.

Description

DISTRIBUTED STORAGE
Field of the Invention The present invention relates to a method and system for storing and retrieving data.
Background of the Invention Organisations require reliable and robust data storage facilities within which to store large quantities of data. These data must also be stored securely especially when this relates to transactional data that may be common in
financial institutions.
As data requirements increase, dedicated data centres may meet these requirements. Very large organisations may build their own data centres and smaller organisations may lease portions of data centres or resources within them. In any case, the overheads involved in the upkeep of data centres may become very large. These costs may include power, heating, maintenance and hardware costs. Although data centres can be designed to provide secure and reliable facilities for organisations, their cost may become
prohibitive. Furthermore, as data requirements increase, so to must the size and complexity of data centres to support this growth.
For smaller organisations that cannot afford their own dedicated data centres, they must place their trust in third-party suppliers of the data resource. This trust ultimately depends on the integrity of individuals, who may be custodians over critical data. This situation may not be optimal for certain organisations or particular data types. To some extent these requirements may be met by online storage facilities.
The Wuala system (http://www.wuala.com/en) provides online storage facilities relying on encryption to provide secure storage. The Wuala software encrypts data files on a user's machine and then sends the encrypted files across the Internet. Wuala's storage includes portions of users' unused hard disks.
WO 2007/120437 describes an algorithm for dividing data into data slices. The original data may be reconstructed from a subset of these slices provided that the subset comprises more than a required number of slices. The slices are stored on storage nodes. A metadata management system stores metadata (including describing the way in which data are dispersed amongst different storage nodes) . Therefore, access to the metadata management system enables access to the original data, which may not be appropriate for
sensitive data.
Therefore, there is required a data storage system that overcomes these problems.
Summary of the invention
Against this background, and from a first aspect, the present invention resides in a method of securely storing data comprising the steps of:
receiving one or more communications identifying a plurality of tested and available remote separate storage locations;
dividing the data file into a plurality of data
subsets; allocating each one of the plurality of data subsets to a different one of the plurality of available remote
separate storage locations;
transmitting and storing each one of the data subsets at the allocated available remote separate storage location; and
transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets. In other words, the users can retain information regarding where each data subset is stored and such information or metadata may be necessary to retrieve and recombine the data subsets. This separation of information can enhance security, whilst benefiting from distributed and remote storage. The method may be carried out by a client of a distributed storage system.
The data may be a data file, digital data, content, document or documents or any other computer readable or transmittable material.
Optionally, the method steps may be carried out on a single machine, computer or group of computers connected to a network such as an intranet or the Internet. The
information specifying the allocation of each of the data subsets or metadata may be stored locally, i.e. on the same machine that carries out each method step or within a group of machines or computers tasked to execute the method. The information specifying the allocation of each of the data subsets may be stored in a number of ways. For example, this information may be stored as a simple file or as an entry in a database. Alternatively, the information maybe divided into metadata subsets such that they may be
regenerated from an incomplete set of the metadata subsets by using parity data and a suitable logical function, e.g. XOR. These metadata subsets and parity data may be further divided into further subsets and further parity data to increase redundancy and safety in case of data corruption.
As a further alternative, the metadata (or first metadata) may be divided and stored in a similar way to that of the data, i.e. allocated as subsets to remote separate storage locations. The received communication may further specify which remote separate storage locations are to be used for data and which are to be used for storing first metadata. Additional, further or second metadata or
information may be generated to specify how the first metadata is allocated amongst the remote separate storage locations. This second metadata may also be stored locally on the machine, computer or group of computers that carries out the method or stored in another way or may act as a key especially when encrypted.
Optionally, the available remote storage locations may be identified by Internet protocol, IP, addresses. This may be in the form of a list of IP addresses.
Optionally, the plurality of data subsets may be allocated to the remote separate storage locations randomly. This may improve utilisation of the storage locations and may further enhance security by making it more difficult to predict or guess where data are stored.
Optionally, the plurality of data subsets may be allocated to a sub-group of the plurality of available remote separate storage locations. Therefore, any central server or administrator may not be able to determine where specific data subsets are stored, further enhancing
security.
Preferably, the remote separate storage locations are physically separate storage locations. This improves reliability by reducing the risk of multiple storage locations becoming unavailable at the same time due to failure of a particular facility.
Preferably, the receiving and transmitting steps occur over a network.
Preferably, the network may be selected from the group consisting of: an intranet, the Internet, a wide area network and a wireless network. These network types allow different types of storage locations to be utilised.
Optionally, the remote separate storage locations may be selected from the group consisting of personal computer, hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server. Other storage types may be used.
Optionally, the method may further comprise the step of testing a connection to each remote separate storage
location. This improves reliability by monitoring
availability, especially over a network. The testing of a connection may be carried out by a device or server remote (or on another part of a network or the Internet) from the client or user storing the data. In other words, the storage method and testing steps may be carried out
separately and independently so that the device or server that carries out the testing steps may not or cannot access the stored data or has enough information to retrieve or read the stored data.
Optionally, the method further comprises the step of maintaining a list of storage identifiers corresponding to remote separate storage locations that pass the connection test. This may also be carried out by a separate device or server from that of the client or user device that stored data in the remote separate storage locations.
Preferably, the received one or more communications may be generated from the maintained list. Optionally, the list may be stored in a database or in other form.
Optionally, the method may further comprise encrypting each data subset and/or the information specifying the allocation of each of the data subsets prior to
transmission. This further improved security as the data may only be decrypted with a key stored by the user. The data may also be encrypted before division.
Optionally, the information specifying the allocation of each of the data subsets may be transmitted in the form of a file allocation table, FAT, file. Therefore, this information may be in a standardised format.
Optionally, the data or data file may be recoverable from a subset of the allocated remote separate storage locations and the method further comprises the steps of: receiving a further communication identifying a
plurality of available remote separate storage locations; and
if more than a predetermined number of the allocated remote separate storage locations are missing from the further communication or indicated as being no longer available then retrieving the data subsets from available allocated remote separate storage locations identified in the further communication, recreating the data from the retrieved data subsets, and repeating the dividing,
allocating and both transmitting steps on the recreated data using the available remote separate storage locations identified in the further communication. Therefore, the recoverability of the data may be increased as remote separate storage locations become unavailable or as further locations become available.
Optionally, dividing the data or data file into a plurality of data subsets further comprises the steps of: a) separating the data into a plurality of separated subsets;
b) generating parity data from the plurality of separated subsets such that any one or more of the plurality of separated subsets may be recreated from the remaining separated subsets and the parity data; and
c) repeating steps a and b on each of the plurality of separated subsets and parity data providing the data subsets consisting of further separated subsets and further parity data. This is one way of dividing the data and improves recoverability, redundancy and security.
Optionally, step c) may be repeated for each of the plurality of data subsets and parity data. Therefore, the data may be cascaded further increasing reliability.
Optionally, the data are separated bit-wise.
Alternatively, the data may be separated byte-wise.
Optionally, the parity data are generated by performing a logical function on the plurality of data subsets.
Optionally, the logical function is an exclusive OR. The XOR function is particularly computationally efficient.
Optionally, the method may further comprise the step of recording where on the available remote separate storage locations the information specifying the allocation of each of the data subsets is stored. This information may be stored with the user and may also be used as the starting point for retrieving data for the remote separate storage locations. In other words, the user may store information that records where a FAT file or equivalent information is stored on the separate remote storage locations and may further record how this FAT file or equivalent is itself divided into data subsets and where and how it may be retrieved and recovered. This information may be fairly brief and therefore small and more portable but may act as a key for recovering much larger data.
According to a second aspect there is provided a method of retrieving data comprising the steps of:
receiving one or more communications identifying a plurality of tested and available remote separate storage locations;
retrieving from one or more of the available remote separate storage locations information specifying the allocation on the remote separate storage locations of a plurality of data subsets forming the data;
retrieving the data subsets from allocated remote separate storage locations; and
combining the retrieved data subsets to form the data. This method may be used to recover or retrieve stored data according to the first aspect.
Optionally, the information specifying the allocation on the remote separate storage locations may be in the form of a file allocation table, FAT, file. Other formats may be used.
Optionally, the retrieving the data subsets step may further comprise the steps of:
a) retrieving parity data from the separate storage locations;
b) recreating any missing data from the retrieved data subsets and parity data to form the recreated data; c) combining the retrieved data subsets and any recreated data to form a plurality of consolidated data sets, wherein the plurality of consolidated data sets include further data and further parity data; and
d) recreating any missing further data from the further data and further parity data to form recreated further data; and e) combining the further data and any recreated further data to form the original data. This improves resilience to loss of data that may be due to missing, failed or otherwise unavailable separate remote storage locations used to store the data subsets.
In accordance with a third aspect there is provided a method of allocating separate storage locations comprising the steps of:
testing the availability of each of a plurality of separate storage locations;
recording the results of the tests together with identifiers of each separate storage location; and
transmitting one or more communications identifying the available separate storage locations. This allows storage locations to be allocated based on reliability and may also improve the recoverability of data and storage by
continually or repeatedly testing availability. Preferably, the testing is carried out by a device, server or
Consolidator or group of such devices that is remote, separate and independent from devices or users that store data on the separate storage locations. This separation, especially over a network or the Internet, provides
additional security for the users' data, limiting its retrieval to the device that stored it or to a device with access to information describing how the data were stored.
Preferably, the testing the availability step may be repeated at intervals. The intervals may be for instance, hourly or daily.
Optionally, the method may further comprise the step of receiving an identifier of a further available separate storage location to include in the testing of availability. Therefore, newly available storage locations may be added or re-added as they appear or their owners provide new storage locations to the system.
In accordance with a fourth aspect there is provided system for securely storing data comprising:
a plurality of separate storage locations;
an availability tester arranged to test the
availability of each of the plurality of separate storage locations, record the results of the tests together with identifiers of each separate storage location and transmit one or more communications identifying the tested and available separate storage locations; and
a data manager arranged to receive the one or more communications identifying the tested and available separate storage locations, divide the data into a plurality of data subsets, allocate each one of the plurality of data subsets to a different one of the available separate storage locations, transmit and store each one of the data subsets at the allocated available separate storage location, and transmit to one or more of the available separate storage locations information specifying the allocation of each of the data subsets.
Optionally, the data manager may be further arranged to retrieve the information specifying the allocation of each of the data subsets, retrieve the data subsets from the allocated separate storage locations and combine the retrieved data subsets to form the data.
The methods may be embodied as computer programs or programmed computers or stored on a computer-readable medium or as a transmitted signal. Brief description of the Figures
The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:
FIG. 1 shows a schematic diagram of a system for storing data, given by way of example only;
FIG. 2 shows a schematic diagram of the system of Fig. 1 in greater detail;
FIG. 3 shows a portion of the system of Fig. 2 in further detail;
FIG. 4 shows a schematic diagram of a portion of the system shown in Fig. 2 in further detail;
FIG. 5 shows a portion of a database schema used in the system of Fig. 1;
FIG. 6 shows a schematic diagram illustrating a method for storing data on a system of Fig. 1;
FIG. 7 shows a schematic diagram of the method steps used to distribute data on the system of Fig. 1;
FIG. 8 shows a schematic diagram of a portion of the system of Fig. 1 in further detail;
FIG. 9 shows a schematic diagram of a portion of a system of Fig. 1 including components used to test storage locations;
FIG. 10 shows a flow diagram of a method used to retrieve maintain data within the system of Fig. 1;
FIG. 11 shows a partial database schema used by the system of Fig. 1; and
FIG. 12 shows a further partial database schema used by the system of Fig. 1.
It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale. Detailed description of the preferred embodiments
Fig. 1 is a schematic diagram of a storage system for storing and retrieving data. The storage system 10 allows a user to send and retrieve data from a user computer 20 to one or more remote separate storage locations 50.
A central server 30 monitors and tests the availability and connectivity of the separate storage locations 50 and records this information in a database 40. The separate storage locations 50 are identified by their internet protocol (IP) addresses. These IP addresses may also be stored in the database 40. For instance, the central server 30 may monitor the period that a storage location 50 is available and the type and speed of connection that may be made to each storage location 50 at any particular time. The testing may be carried out at intervals.
The separate storage locations 50 may be individual and separate personal computers each having a hard drive 60 or other storage device. Within each hard drive 60 there may be an allocated portion 70 made available to the storage system 10. The allocated portion 70 may be partitioned or otherwise made separate from the remaining portion of the hard drive 60 and so unavailable for use by the particular personal computer except for the purpose of remote storage.
The central server 30 sends a message or communication across a network link 90 to the user computer 20. The message contains a list of the IP addresses of each
available separate storage location 50. Therefore, the user computer 20 receives a message containing the IP addresses or other identifier of each available remote separate storage location 50 to which it may send and retrieve data. A software application 99 running on the user computer 20 divides any data to be stored into separate data sets. Each of these subsets of data may be transmitted across a data link 100 to the allocated portion 70 of the hard drive 60 of a particular personal computer (or other computer or server) making up the separate storage locations 50.
The software application running on the user computer 20 may determine which particular data subsets to store on each available separate storage location 50. It may not be necessary to use all of the available separate storage location 50 detailed in the communication sent from the central server 30, especially if the data to be stored is smaller than the available storage space from the total number of the separate storage locations 50. The data subsets may be allocated to the separate storage locations 50 randomly, for instance. The software application may also record information identifying which data subsets were stored on which particular available remote storage
locations 50. This information may be stored in the form of a File Allocation Table, FAT, file or in another suitable format. The software application may provide the central server 30 with an indication of its data requirements including size are reliability or availability level.
The data to be stored may be divided in such a way that each individual data subset cannot be used to recreate the original data without a minimum number of data subsets.
Furthermore, the data may be divided into data subsets such that the loss of any particular data subset may be tolerated without resulting in a total loss of original data.
The network and data connections may be made across an extended network 110 such as an intranet or the Internet, for instance. The availability and reliability of each separate storage location 50 may be tested at regular intervals and as further separate storage locations 50 become available, their details and IP addresses may be added to the database 40 and updated messages may be sent to the user computer 20 as this information changes.
Furthermore, the integrity of the stored data does not require all separate storage locations 50 to be available at the same time. Therefore, the reliability of the entire data storage system 10 does not require highly available storage locations, such as those provided by a data centre.
The owners of individual personal computers or perhaps computers on a corporate network may make the allocated portion 70 of their own internal hard drives 60 available to the system 10 in return for benefits, payments or the use of other services. These owners may sign up through Internet sites, social networking sites or by running downloaded software, for instance. To enhance security further, the data subsets may be encrypted before being sent over the data network 100 to the separate storage locations 50.
Retrieval of the original data from the separate storage locations 50 may be carried out by retrieving each data subset (or at least the minimum number required to
regenerate the data) . Once retrieved, the data subsets may be recombined within the user computer 20. Any missing data subsets, perhaps due to the unavailability of
particular personal computers, may be recreated from the remaining data subsets, if necessary using a suitable algorithm. Therefore, reliability and security may be maintained within the data storage system 10 without the need for dedicated data centres. The FAT file may be used to retrieve and regenerate the data. Furthermore, the FAT file may also be stored on one or more of the remote
separate storage locations 50. The user computer 20 may act as an administrator for a group of computers on a network so that the stored data may be gathered from more than one computer.
The personal computers making up the remote separate storage locations 50 may be left on and connected to the Internet 110 for extended periods to maintain their
availability resulting in increased availability, as
monitored by the central server 30. Should a separate storage location 50 fail an availability test a
predetermined number of times or not be available for a predetermined period then testing of that particular
separate storage location 50 may cease and its IP address may be removed from the database 40 or marked as
unavailable. The owner of that particular removed separate storage location 50 may be notified of the situation
allowing them to improve the availability of this particular location so that the that particular remote separate storage location 50 may be returned to the list of available
locations at a future time.
Although Fig. 1 shows a single user computer 20 and three separate storage locations 50 there may be many more of each. Furthermore, there may be multiple central servers 30 each allocated to a group of user computers 20 and/or separate storage locations 50.
The central server 30 may allocate to each user computer 20 a subset of IP addresses relating to a subset of the available separate storage locations 50. The number of allocated IP addresses may be proportional to the storage requirements of the user computer 20 with more IP addresses allocated for higher usage user computers 20.
Each allocated portion 70 may be of a different size and information regarding the total capacity, the used capacity and/or the remaining capacity may be determined and stored in the database 40 of the central server 30.
Therefore, allocation of storage may also be made based on the available storage space for each remote separate storage location 50. The owners of the remote separate storage locations 50 may change the amount of storage space
allocated and these changes may be communicated to the central server 30.
Users of the service may include home users and
corporate users. Home users may require simple backup of files or convenient and secure file sharing. Home users may both share space and may use the distributed storage space. The central server 30 may administer different security or reliability ratings required by each user. This may result in home users, for instance, being provided with lower security and reliability levels than those offered to corporate users. This may allow different types of
dispersal and retrieval algorithms to be executed by
different types of users.
Corporate Users may be companies needing secure
storage. This type of user may require storage with higher rating and reliability to ensure security of data retrieval and a higher quality network. Furthermore, corporate users may also have the option to use an entirely isolated and dedicated service, i.e. a full network of dedicated remote separate storage locations 50 and one or more dedicated central servers 30 also known as Consolidators . Each
Consolidator may be a single central server 30 or group of servers. The system may comprise one or more Consolidators 30 for load balancing purposes. Dedicated Consolidators 30 may be provided for a particular purpose or group of users. A complete system of Consolidators 30, remote storage and clients may be replicated for this type of user and they may also have a dedicated intranet to link these components. Fig. 2 shows a schematic diagram of a simplified but extended system to that shown in Fig. 1. Fig. 2 shows three user computers 20 of different types. User computer or domain 120 represents a home user domain 130 and 140
represent company users. Both domain types may operate on servers 98 or other computers. The requirements for the home user 120 may be relatively small and those of the company users 130, 140 may be higher. In each case, the users or clients may use storage space from other users and may optionally provide space to others.
The operation of the system may be explained by
reference to an example scenario. A user of the system may wish to save a particular file. In order to do this, the user may use the software application 99 to upload the file over the Internet 110. The first step is authentication (via a connection indicated by a dotted line between the user computer 20 and the Consolidator or central server 30) and this is shown schematically in further detail in Fig. 3. Data are transmitted as data subsets through the Internet 110 as indicated by the solid lines. The Consolidator 30 tests connections and availability of the home user and company domains 130, 140 used to provide the separate storage locations 50 using further Internet transmissions as shown by dashed lines.
The client domain 120 sends a username (User) , password (PWD) and GUID to the central server or Consolidator 30 for authentication .
Actors in the system may refer to each other with a GUID that is a unique ID assigned to each installation of the software application. This may be independent of user and IP address. The IP address of the actor may be used during a procedure, when an IP Address is needed in a communication. A remote storage location 50 can change its IP address, for instance, especially where dynamically allocated IP addresses change whenever a home user
disconnects and re-connects to the Internet. However, the GUID will remain constant.
Following successful authentication the Consolidator 30 may send the user a communication or storage list containing a list of available remote storage locations 50. This may be in the form of a list of IP addresses and may be
subdivided into two groups: the first group may be the Data IP Collection list (DAT), i.e. a list of storage locations where users' data may be stored. The second list may be the FAT IP Collection (FAT), i.e. where the FAT files may be stored.
The FAT IP Collection may contain a list of all IP Addresses of remote storage locations 50 where the client can find a list of uploaded files and other information to recreate its virtual directory tree. Therefore, the user may use the system from different PCs and can retrieve all its uploaded files with the directory structure created. With this arrangement, no sensitive information needs to be stored at the Consolidator 30. The FAT file may be saved on the system in a similar way to other data files. Therefore, the FAT file can also be stored securely. Furthermore, the FAT file may be written or encoded so that only the client can read it.
Fig. 4 illustrates schematically how the client may retrieve and reconstruct information forming the FAT file. In summary, the client domain 120 is provided with a list (storage list Uid/IP) of available remote storage locations 50 in the form of IP addresses 1. The client domain 120 retrieves from the available remote storage locations portions of data that may be combined to form a recreated FAT file. The retrieval of data subsets is indicated schematically by arrows 2. The data subsets are combined or consolidated using an algorithm 3 to form a complete data set and if required, regenerating any missing or damaged data subsets from retrieved parity data and available data subsets. The dispersal algorithm is described in further detail with reference to the Fig. 7. These portions may be encrypted and so decryption 4 may be required before the FAT file may be used. The user may store information specifying how and where the FAT file is stored.
Fig. 5 shows a schematic diagram of various tables of a database schema used in the FAT file reconstruction process. tb_Storages contains the IP addresses of all installations or domains providing allocated portions 70 of available storage space. tb_Users contains all registered users.
tb_LNK_Users_Storages contains information about which user is using a storage and the scope of that storage (i.e. for files and/or FAT) . Following a successful authentication, a Consolidator 30 generates a list of IP address to send to a client domain 120 allowing retrieval of the portions of that user's particular FAT file. Other information may be stored in these database tables.
Should a hacker impersonate a user in order to obtain the FAT file information they will be prevented from do so as at least parts of the FAT file may be encrypted. To decrypt the FAT file (and any or all other files that may be retrieved) a decryption key is required. Alternatively, the information specifying how and where on the separate storage locations the FAT file is stored may be required before the FAT file can be reconstructed.
The decryption key may be a passkey string, a biometric signature or other suitable data. The decryption passkey should preferably be strong and not saved on any part of the system. Therefore, a user may need to remember the passkey, which itself may be used to retrieve an encrypted file with a further hardcoded key. Under one example implementation ten hardcoded keys may be saved as one or more temporary files (which may be viewed as cookies) . The file may be encrypted using one of these ten hardcoded keys chosen at random. To decrypt the file, the client may try with all ten hardcoded keys. Nine out of ten hardcoded keys will fail but user will not be aware of this. However, such a procedure makes it more difficult for a hacker to obtain or generate the hardcoded keys. In other words, the FAT file may be encrypted using a different hardcoded key each time it is generated or changed.
Security may be further enhanced by the use of
obfuscation of code. The hardcoded keys may be generated using a hardware based procedure. Therefore, the client may recreate a key every time without hardcode (and so readable) keys. Obfuscation provides additional security against hackers. The FAT file may be a flat text file containing information in XML format, for instance.
Fig. 6 shows schematically the procedure for uploading a file or data. Uploading a file may require successful login into the system. When a user attempts to upload a file, he has to indicate which file he wants to upload from his machine (or from any reachable location excluding perhaps Internet locations). The client application then starts a series of operations. The first operation may be to encrypt the file. Encryption may or may not be used.
After file selection, the client application may start to encrypt the file with a given key from the user (for instance using a string or a biometric reading) . Then it starts the dispersal algorithm in order to divide the data in to data subsets, PKGl, PKG2, PKG3, etc. The dispersal algorithm provide a method to split file into several slices .
Fig. 7 shows schematically the operation of the example dispersal algorithm. Each file may be split into three data subsets and each data subset may then be split a further three times, and so on until the number of required data subsets has been reached or each data subset has reached a particular size. Greater division provides additional security and reliability for recreating the original file.
After receiving a communication containing the list of available storage locations 50 the client application may decide where to store each data subset. If the storage locations are not sufficient to store each data subset (or perhaps if a remote separate storage location is no longer available) then a new IP address representing a further storage location 50 may be required. Further separate remote storage locations may be included in a further communication received from a Consolidator 30. In this case, the client application may inform the Consolidator 30 that another IP address will be used.
This operation may be considered as a single logical transaction. Therefore, when a storage location 50 (through its client application) receives a data subset or slice it may need to wait for a signal that all other slices are stored and secured. Failure to receive such a signal after a time out or predetermined period may then indicate that the data subset should be deleted so that a further attempt may be made. The Consolidators 30 are otherwise not used during this procedure.
Successful completion of the logical transaction may further require generation of the FAT file and completion of the FAT file transmission. Otherwise, a complete rollback may be required, such that all data subsets may require deletion if a traceable and retrievable FAT file cannot be confirmed.
Fig. 8 shows a schematic diagram of a data retrieval process. This is similar to the retrieval of the FAT file descried with reference to Fig. 4. However, instead of portions of the FAT file, data subsets may be retrieved from the separate storage locations 50.
Fig. 9 shows schematically how the list of available storage locations 50 is compiled and maintained. Operations are carried out to update data monitoring the reliability of client applications associated with separate storage
locations 50. Alive signals or requests may be propagated to each actor, e.g. each client providing a storage location 50. At a scheduled time, each of these clients may send a signal or communication to a Consolidator 30 informing that they are still alive and online. The Consolidator 30 may receive this signal from a group of clients and send
acknowledgments, accordingly.
The clients may receive these communications or
messages and respond to Consolidators 30, accordingly. This procedure allows Consolidators 30 to test the speed and reliability of the remote storage locations 50. An
historical log of these performances may be preserved in the database 40 of a Consolidator or central server 30. Clients that fail such a test, e.g. by not responding within a predetermined period may be deleted from the database 40 or marked as unavailable. Such clients may also be excluded from the list of available storage locations in any
communications transmitted from a Consolidator 30. Clients may also be added or re-added should they become available and pass the alive test.
As data subsets are stored at remote separate storage locations 50 that may each be controlled by a separate user, particular storage locations 50 may temporarily or
permanently become unavailable. This situation may be mitigated by redundancy incorporated into the system by the data separation algorithm allowing data to be recreated from incomplete sets of data subsets. However, the system requires a minimum number of available data subsets.
As described above with reference to Fig. 9, the availability of each separate storage location 50 is monitored, including those used to store existing data subsets. The client application is kept informed of the availability of these storage locations 50 by receiving communications from the central server or Consolidator 30. When a particular client application determines that a certain proportion of its utilised separate storage
locations 50 are no longer available, action may be required to prevent data loss of any stored data. Under these circumstances the data may be recovered from the remaining available storage locations 50 and redeployed to further available storage locations 50 maintaining redundancy.
Fig. 10 shows a flow chart 200 describing such a procedure. The client receives the communication containing the list of available separate storage locations 50. This is received from a Consolidator 30 and may comprise IP addresses of available remote storage locations for both data and FAT files.
At step 210 the client may recover a particular FAT file from the available separate storage locations 50. The FAT file contains information regarding where data subsets accessible to the particular user, are stored. At step 220 the software application determines if any used storage locations 50 are no longer available. If all are available then there is nothing to do and the procedure ends.
However, if one or more storage locations 50 are no longer available then a determination is made at step 230 as to whether or not a safe level of redundancy remains, i.e. if there is a minimum percentage or number of data subsets available. This minimum may be predetermined and may depend on the required level of security or safety.
The data in question may still reside on the client machine in the form of a cache file. Therefore, step 240 determines if such a cache exists. If no cache exists then the file may be retrieved from the available separate storage locations and recreated using the recovered FAT file at step 250. If the file exists in a cache then it may instead be retrieved from this cache at step 260.
Once the data or file is recovered from either the cache or from the available storage locations 50 (i.e.
according the data download procedure described with
reference to Fig. 8) then at step 270 the data is divided into new data subsets and uploaded to the available separate storage locations according to the method described with reference to Fig. 7, therefore restoring any lost
redundancy. The space allocated to store the original data subsets may then be recovered by deleting these data subsets from the remote separate storage locations 50 at step 290.
Fig. 11 shows a partial schema of the database tables maintained at the central server or Consolidators 30 used to determine how each separate remote storage location 50 is used or what type of data is may store. This schema
includes the database tables described with reference to Fig. 5 but includes table tb_StorageTypes that stores the type of data stored at each separate remote storage location 50. For instance, the type of storage may be FAT file, data subset or describe disk space offered by each user to store other users data subsets and FAT files.
Storage Types can include, for example: • Storage used by User to store parts of FAT file;
• Storage used by User to store parts of upload file; and
• Storage offered by User to host parts of file of other Users.
Fig. 12 shows a further partial database schema similar to that of Fig. 11 but including table
tb_StoragesAliveSignals, which stores data recording which separate remote storage locations 50 are available and may include additional reliability data such as the time that each particular location was tested. Therefore, more recently tested locations may be considered more reliable than those that passed an Alive test less recently.
The quality of each location may also be recorded and statistics of reliability may be maintained. For example, the system may retain only the last "N" alive signals and tests or response may be weighted such that more recent tests of availability may have a greater weighting of reliability. These data may be calculated and updated in real time.
As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.
For example, the remote separate storage locations may include home user or corporate user computers or servers. The central servers 30 or Consolidators may run Microsoft (RTM) Windows Server 2008, SQL Server 2008 and .NET
Framework 3.5 Service Pack 1, for example. The user
computers 20 may run Windows, Linux or Mac OS X operating systems with Framework 3.5 or a version of MONO or other suitable software. The method may be implemented as software formed from a suitable programming language including C++ and any other object oriented programming language. Non-object oriented languages may also be used.
Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention.

Claims

CLAIMS :
1. A method of securely storing data comprising the steps of:
receiving one or more communications identifying a plurality of tested and available remote separate storage locations;
dividing the data into a plurality of data subsets; allocating each one of the plurality of data subsets to a different one of the plurality of available remote
separate storage locations;
transmitting and storing each one of the data subsets at the allocated available remote separate storage location; and
transmitting to one or more of the available remote separate storage locations information specifying the allocation of each of the data subsets.
2. The method of claim 1, wherein the available remote storage locations are identified by Internet protocol, ip, addresses .
3. The method of claim 1 or claim 2, wherein the plurality of data subsets are allocated to the remote separate storage locations randomly.
4. The method according to any previous claim, wherein the plurality of data subsets are allocated to a sub-group of the plurality of available remote separate storage
locations.
5. The method according to any previous claim, wherein the remote separate storage locations are physically separate storage locations.
6. The method according to any previous claim, wherein the receiving and transmitting steps occur over a network.
7. The method of claim 6, wherein the network is selected from the group consisting of: an intranet, the Internet, a wide area network and a wireless network.
8. The method according to any previous claim, wherein the remote separate storage locations are selected from the group consisting of personal computer, hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server.
9. The method according to any previous claim further comprising the step of testing a connection to each remote separate storage location.
10. The method of claim 9 further comprising the step of maintaining a list of storage identifiers corresponding to remote separate storage locations that pass the connection test.
11. The method of claim 10, wherein the received one or more communications are generated from the maintained list.
12. The method of claim 10 or claim 11, wherein the list is stored in a database.
13. The method according to any previous claim further comprises encrypting each data subset and/or the information specifying the allocation of each of the data subsets prior to transmission.
14. The method according to any previous claim, wherein the information specifying the allocation of each of the data subsets is transmitted in the form of a file allocation table, FAT, file.
15. The method according to any previous claim, wherein the data is recoverable from a subset of the allocated remote separate storage locations and the method further comprises the steps of:
receiving a further communication identifying a
plurality of available remote separate storage locations; and
if more than a predetermined number of the allocated remote separate storage locations are missing from the further communication or indicated as being no longer available then retrieving the data subsets from available allocated remote separate storage locations identified in the further communication, recreating the data from the retrieved data subsets, and repeating the dividing,
allocating and both transmitting steps on the recreated data using the available remote separate storage locations identified in the further communication.
16. The method according to any previous claim, wherein dividing data into a plurality of data subsets further comprises the steps of:
a) separating the data into a plurality of separated subsets; b) generating parity data from the plurality of separated subsets such that any one or more of the plurality of separated subsets may be recreated from the remaining separated subsets and the parity data; and
c) repeating steps a and b on each of the plurality of separated subsets and parity data providing the data subsets consisting of further separated subsets and further parity data.
17. The method of claim 16, wherein step c) is repeated for each of the plurality of data subsets and parity data.
18. The method to claim 16 or claim 17, wherein the data are separated bit-wise.
19. The method according to any of claim 16-18, wherein the parity data are generated by performing a logical function on the plurality of data subsets.
20. The method of claim 19, wherein the logical function is an exclusive OR.
21. The method according to any previous claim further comprising the step of recording where on the available remote separate storage locations the information specifying the allocation of each of the data subsets is stored.
22. A method of retrieving data comprising the steps of: receiving one or more communications identifying a plurality of tested and available remote separate storage locations;
retrieving from one or more of the available remote separate storage locations information specifying the allocation on the remote separate storage locations of a plurality of data subsets forming the data;
retrieving the data subsets from allocated remote separate storage locations; and
combining the retrieved data subsets to form the data.
23. The method of claim 22, wherein the information
specifying the allocation on the remote separate storage locations is in the form of a file allocation table, FAT, file.
24. The method of claim 22 or claim 23, wherein the
retrieving the data subsets step further comprises the steps of:
a) retrieving parity data from the separate storage locations;
b) recreating any missing data from the retrieved data subsets and parity data to form the recreated data; c) combining the retrieved data subsets and any recreated data to form a plurality of consolidated data sets, wherein the plurality of consolidated data sets include further data and further parity data; and
d) recreating any missing further data from the further data and further parity data to form recreated further data; and
e) combining the further data and any recreated further data to form the original data.
25. A method of allocating separate storage locations comprising the steps of:
testing the availability of each of a plurality of separate storage locations;
recording the results of the tests together with identifiers of each separate storage location; and
transmitting one or more communications identifying the available separate storage locations.
26. The method of claim 25, wherein the testing the
availability step is repeated at intervals.
27. The method of claim 25 or claim 26 further comprising the step of receiving an identifier of a further available separate storage location to include in the testing of availability.
28. A system for securely storing data comprising:
a plurality of separate storage locations;
an availability tester arranged to test the
availability of each of the plurality of separate storage locations, record the results of the tests together with identifiers of each separate storage location and transmit one or more communications identifying the available
separate storage locations; and
a data manager arranged to receive the one or more communications identifying the tested and available separate storage locations, divide the data into a plurality of data subsets, allocate each one of the plurality of data subsets to a different one of the available separate storage locations, transmit and store each one of the data subsets at the allocated available separate storage location, and transmit to one or more of the available separate storage locations information specifying the allocation of each of the data subsets.
29. The system of claim 28, wherein the data manager is further arranged to retrieve the information specifying the allocation of each of the data subsets, retrieve the data subsets from the allocated separate storage locations and combine the retrieved data subsets to form the data.
30. The system of claim 28 or claim 29, wherein the availability tester is further arranged to maintain a list of identifiers corresponding to the separate storage locations that pass the availability test.
31. The system of claim 30, wherein the transmitted one or more communications are generated from the maintained list.
32. A computer program comprising program instructions that, when executed on a computer cause the computer to perform the method of any of claims 1 to 27.
33. A computer-readable medium carrying a computer program according to claim 32.
34. A computer programmed to perform the method of any of claims 1 to 27.
PCT/GB2010/001346 2009-07-17 2010-07-14 Distributed storage WO2011007141A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0912508.9 2009-07-17
GB0912508A GB2467989B (en) 2009-07-17 2009-07-17 Distributed storage

Publications (1)

Publication Number Publication Date
WO2011007141A1 true WO2011007141A1 (en) 2011-01-20

Family

ID=41058169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2010/001346 WO2011007141A1 (en) 2009-07-17 2010-07-14 Distributed storage

Country Status (2)

Country Link
GB (1) GB2467989B (en)
WO (1) WO2011007141A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102011010613A1 (en) * 2011-02-08 2012-08-09 Fujitsu Technology Solutions Intellectual Property Gmbh A method of storing and restoring data, using the methods in a storage cloud, storage server and computer program product
CN110546938A (en) * 2017-02-10 2019-12-06 施莱德有限公司 Decentralized data storage

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240749A1 (en) * 2004-04-01 2005-10-27 Kabushiki Kaisha Toshiba Secure storage of data in a network
US20070079083A1 (en) * 2005-09-30 2007-04-05 Gladwin S Christopher Metadata management system for an information dispersed storage system
WO2007133791A2 (en) * 2006-05-15 2007-11-22 Richard Kane Data partitioning and distributing system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6789077B1 (en) * 2000-05-09 2004-09-07 Sun Microsystems, Inc. Mechanism and apparatus for web-based searching of URI-addressable repositories in a distributed computing environment
WO2001098952A2 (en) * 2000-06-20 2001-12-27 Orbidex System and method of storing data to a recording medium
US20020032844A1 (en) * 2000-07-26 2002-03-14 West Karlon K. Distributed shared memory management
JP3951949B2 (en) * 2003-03-31 2007-08-01 日本電気株式会社 Distributed resource management system, distributed resource management method and program
TW200527223A (en) * 2003-11-28 2005-08-16 Cpm S A Electronic computing system-on demand and method for dynamic access to digital resources
JP4485230B2 (en) * 2004-03-23 2010-06-16 株式会社日立製作所 Migration execution method
US7403945B2 (en) * 2004-11-01 2008-07-22 Sybase, Inc. Distributed database system providing data and space management methodology
US8060648B2 (en) * 2005-08-31 2011-11-15 Cable Television Laboratories, Inc. Method and system of allocating data for subsequent retrieval
US8230432B2 (en) * 2007-05-24 2012-07-24 International Business Machines Corporation Defragmenting blocks in a clustered or distributed computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240749A1 (en) * 2004-04-01 2005-10-27 Kabushiki Kaisha Toshiba Secure storage of data in a network
US20070079083A1 (en) * 2005-09-30 2007-04-05 Gladwin S Christopher Metadata management system for an information dispersed storage system
WO2007120437A2 (en) 2006-04-13 2007-10-25 Cleversafe, Inc. Metadata management system for an information dispersed storage system
WO2007133791A2 (en) * 2006-05-15 2007-11-22 Richard Kane Data partitioning and distributing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Storage Virtualization - Definition Why, What, Where, and How?", INTERNET CITATION, 1 November 2004 (2004-11-01), XP002393991, Retrieved from the Internet <URL:http://www.snseurope.com/snslink/magazine/features-full.php?id=2236&m agazine=November%202004> [retrieved on 20060808] *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102011010613A1 (en) * 2011-02-08 2012-08-09 Fujitsu Technology Solutions Intellectual Property Gmbh A method of storing and restoring data, using the methods in a storage cloud, storage server and computer program product
US9419796B2 (en) 2011-02-08 2016-08-16 Fujitsu Limited Method for storing and recovering data, utilization of the method in a storage cloud, storage server and computer program product
DE102011010613B4 (en) * 2011-02-08 2020-09-10 Fujitsu Ltd. Method for storing and restoring data, use of the methods in a storage cloud, storage server and computer program product
CN110546938A (en) * 2017-02-10 2019-12-06 施莱德有限公司 Decentralized data storage
CN110546938B (en) * 2017-02-10 2023-06-27 施莱德有限公司 Decentralized data storage

Also Published As

Publication number Publication date
GB2467989A (en) 2010-08-25
GB0912508D0 (en) 2009-08-26
GB2467989B (en) 2010-12-22

Similar Documents

Publication Publication Date Title
US11157366B1 (en) Securing data in a dispersed storage network
US10416889B2 (en) Session execution decision
US11256558B1 (en) Prioritized data rebuilding in a dispersed storage network based on consistency requirements
US10089036B2 (en) Migrating data in a distributed storage network
US8171101B2 (en) Smart access to a dispersed data storage network
EP2755161B1 (en) Secure online distributed data storage services
CN104603740B (en) Filing data identifies
AU2011305569B2 (en) Systems and methods for secure data sharing
US20170005797A1 (en) Resilient secret sharing cloud based architecture for data vault
US20170091031A1 (en) End-to-end secure data retrieval in a dispersed storage network
AU2016203740B2 (en) Simultaneous state-based cryptographic splitting in a secure storage appliance
US9665429B2 (en) Storage of data with verification in a dispersed storage network
Kuo et al. A hybrid cloud storage architecture for service operational high availability
CN104603776A (en) Archival data storage system
AU2011289239A1 (en) Systems and methods for secure remote storage of data
US10558581B1 (en) Systems and techniques for data recovery in a keymapless data storage system
AU2015203172B2 (en) Systems and methods for secure data sharing
US20080195675A1 (en) Method for Pertorming Distributed Backup on Client Workstations in a Computer Network
Jogdand et al. CSaaS-a multi-cloud framework for secure file storage technology using open ZFS
WO2011007141A1 (en) Distributed storage
CN111565144A (en) Data layered storage management method for instant communication tool
GB2483222A (en) Accessing a website by retrieving website data stored at separate storage locations
US11782789B2 (en) Encoding data and associated metadata in a storage network
Karnakanti Reduction of spatial overhead in decentralized cloud storage using IDA
Akintoye et al. A Survey on Storage Techniques in Cloud Computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10739657

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10739657

Country of ref document: EP

Kind code of ref document: A1