US20080320097A1 - Network distributed file system - Google Patents

Network distributed file system Download PDF

Info

Publication number
US20080320097A1
US20080320097A1 US12/143,134 US14313408A US2008320097A1 US 20080320097 A1 US20080320097 A1 US 20080320097A1 US 14313408 A US14313408 A US 14313408A US 2008320097 A1 US2008320097 A1 US 2008320097A1
Authority
US
United States
Prior art keywords
component
storage pool
file
storage
file information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/143,134
Inventor
Antoni SAWICKI
Tomasz NOWAK
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TENOWARE R&D Ltd
Original Assignee
TENOWARE R&D Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TENOWARE R&D Ltd filed Critical TENOWARE R&D Ltd
Assigned to TENOWARE R&D LIMITED reassignment TENOWARE R&D LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NOWAK, TOMASZ, SAWICKI, ANTONI
Publication of US20080320097A1 publication Critical patent/US20080320097A1/en
Priority to US13/492,633 priority Critical patent/US20130151653A1/en
Priority to US13/492,615 priority patent/US9880753B2/en
Priority to US14/519,003 priority patent/US20150067093A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1443Transmit or communication errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • G06F11/1662Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit the resynchronized component or unit being a persistent storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/12Arrangements for detecting or preventing errors in the information received by using return channel
    • H04L1/16Arrangements for detecting or preventing errors in the information received by using return channel in which the return channel carries supervisory signals, e.g. repetition request signals
    • H04L1/18Automatic repetition systems, e.g. Van Duuren systems
    • H04L1/1809Selective-repeat protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/815Virtual
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/104Metadata, i.e. metadata associated with RAID systems with parity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms

Definitions

  • the present invention relates to a method and apparatus for distributing files within a network.
  • Hard disks suffer from following issues:
  • disks can be combined in to larger pools of storage with data protection, for example, in a RAID array.
  • disks or pools of disks can be either be:
  • DAS Directly Attached Storage e.g. USB, SCSI, Fiber Channel.
  • DAS Directly Attached Storage
  • USB Serial Bus
  • SCSI Serial Bus
  • Fiber Channel e.g. USB, SCSI, Fiber Channel.
  • DAS is only capable being connected to very limited ( ⁇ 10) number of servers at a time.
  • NAS Network Attached Storage e.g. Ethernet, TCP/IP, IPX/SPX.
  • NAS is just a more advanced, specifically designed in hardware file server.
  • SAN External—SAN Storage Area Network e.g. Fiber Channel (FC) network infrastructure.
  • FC Fiber Channel
  • a pool of storage usually needs to be accessible by more than just a single machine.
  • a “file server” which is a dedicated computer on the network, providing its storage pool (connected through any of the above 4 ways, internal or external) transparently to other computers by a “File Sharing Protocol” over a computer network (LAN/WAN/etc) with the possibility of adding extra security (access control) and availability (backups) from a central location.
  • the present invention provides a virtual storage pool from a combination of unused disk resources from participating nodes presented as a single virtual disk device (of combined size) available to and shared with all nodes of a network.
  • the virtual storage pool is visible as normal disk drive (disk letter on Windows and mount point on Unix), however all disk I/O is distributed to participating nodes over the network. In the present specification, this is referred to as a Network Distributed File System (NDFS).
  • NDFS Network Distributed File System
  • the virtual storage pool could become unavailable or inconsistent.
  • data is distributed in such a way that if any number (of predefined) participating nodes become unavailable, the virtual storage pool is still accessible to the remaining computers.
  • the invention is based on a Peer-to-Peer (P2P) network protocol which allows data to be stored and retrieved in a distributed manner.
  • P2P Peer-to-Peer
  • the data redundancy mechanism is used at the protocol level in such a way that if any of participating nodes is inaccessible or slow in response to requests, they are automatically omitted in the processing.
  • the protocol therefore is network congestion or break resistant.
  • a virtual storage pool of size 2.5 TB could be formed and made available to all nodes on the network.
  • the storage pool of the preferred embodiments in contrast to the traditional file server approach has following characteristics:
  • the present invention provides a storage pool component operable on a computing device including a storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other storage pool component.
  • Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity.
  • the storage pool component has configuration data identifying the at least one other computing device to which the computing device may connect across the network, a directory for identifying file information for files of the storage pool stored on the storage medium, the file information being stored with a degree of redundancy across the computing devices of the storage pool, means responsive to instantiation of the component for communicating with at least one other component operating on one of the at least one other computing devices for verifying the contents of the directory, means for reconciling file information stored on the storage medium with file information from the remainder of the storage pool, and a driver.
  • the driver is responsive to an access request for a file stored in the storage pool received across the network from another component of the storage pool, and determines a location of the file on the storage medium from the directory and for accessing the accordingly.
  • the present invention provides a system having a plurality of computing devices.
  • the plurality of computing devices each has a storage medium.
  • At least one of the computing devices includes a storage pool component.
  • the storage pool component is operable on the computing device and the storage medium has an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and is operably connected across a network to at least one other storage pool component.
  • Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity.
  • the system also includes one or more legacy clients accessing the storage pool through a legacy disk device driver.
  • FIG. 1 shows schematically a pair of virtual storage pools (VSP) distributed across a set of nodes according to an embodiment of the present invention
  • FIG. 2 shows a client application accessing a virtual storage pool according to an embodiment of the invention
  • FIG. 3 shows the main components within a Microsoft Windows client implementation of the invention
  • FIG. 4 shows the main components within an alternative Microsoft Windows client implementation of the invention
  • FIG. 5 a write operation being performed according to an embodiment of the invention.
  • FIG. 6 shows a cluster of VSPs in a high availability group.
  • Storage Pool refers to a pool of physical disks or logical disks served by a SAN or LUNs (Logical Units) in DAS. Such a storage pool can be used either as a whole or partially, to create a higher level “logical volume(s)” by means of RAID-0, 1, 5, etc. before being finally presented through the operating system.
  • a storage pool is created through a network of clients communicating through a P2P network protocol.
  • the network includes the following node types:
  • VSP therefore can function for example on:
  • a VSP Virtual Storage Pool
  • VSP A or VSP B is formed up from Local Storage Entities (LSE) served by either Server or Peer Nodes 1 . . . 5 .
  • LSE Local Storage Entities
  • an LSE can be just a hidden subdirectory on a disk of the node.
  • alternative implementations referred to later could implement an LSE as an embedded transactional database.
  • LSE size is determined by the available free storage space on the various nodes contributing to the VSP.
  • LSE size is the same on every node, and so global LSE size within a VSP will be dependent on smallest LSE in the VSP.
  • VSP VSP Geometry
  • the size of the VSP equals N+1 multiplied by size of LSE.
  • a VSP is visible to an Active Client, Peer Node or indeed Legacy Client as a normal disk formed from the combination of LSEs with one of the geometries outlined above.
  • a client stores or retrieves data from a VSP it attempts to connect to every Server or Peer Node of the VSP and to perform an LSE I/O operation with an offset based on VSP Geometry.
  • VSP Cluster Size data (contents of the files before redundancy is calculated) is divided into so called clusters, similar in to data clusters of traditional disk based file systems (FAT, NTFS).
  • Cluster size is determined by VSP Geometry and NBS (Network Block Size) in following way:
  • u_ndfs.exe Upon startup, u_ndfs.exe reads config.xml, a configuration file, which defines LSE location and VSP properties i.e. geometry, disk name and IP addresses of peer nodes. (The configuration file is defined through user interaction with a configuration GUI portion (CONFIG GUI) of U_ndfs.) U_ndfs then spawns a networking P2P protocol thread, NDFS Service. The network protocol used by the thread binds to a local interface on a UDP port and starts network communications with other nodes contributing to the VSP.
  • the VSP is suspended for that node until a quorum is reached.
  • a recovery procedure is performed by retrieving redundant network parts from one or more other nodes and rebuilding LSE data for the given file/directory.
  • the node closest to the requesting node based on network latency is used as the source for metadata recovery.
  • a file is split into N+M clusters, each cluster containing a data component and a redundant component.
  • the previously unavailable node must obtain at least N of the clusters in order to rebuild the cluster which should be stored for the file on the recovering node to maintain the overall level of redundancy for all files of the VSP.
  • the networking protocol should remain aware of network failure and needs to perform an LSE rescan and recovery every time the node is reconnected to the network. The user should be alerted to expect access to the VSP when this happens.
  • a transaction log can be employed to speed up the recovery process instead of using a directory scan, and if the number of changes to the VSP exceeds the log size, a full recovery could be performed.
  • the networking thread begins server operations and u_ndfs.exe loads a VSP disk device kernel driver (ndfs.sys).
  • the disk device driver (NFDS Driver) then listens to requests from the local operating system and applications, while u_ndfs.exe listens to requests from other nodes through the networking thread.
  • an application for instance Microsoft Word
  • the I/O subsystem in the OS kernel and requests a portion of data with an offset (0 to file length) and size. (If the size is bigger than HBS, the kernel will fragment the request to smaller subsequent requests).
  • the I/O subsystem then sends an IRP (I/O request packet) message to the responsible device driver module, NFDS driver.
  • the kernel device driver receives the request and passes it on to the P2P network protocol thread, NDFS Service, for further processing based on the VSP geometry.
  • Both client and server I/Os can be thought of as normal I/O operations with an exception that they are intercepted and passed through the NDFS driver and NDFS service like a proxy. N+M redundancy can thus be implemented with the P2P network protocol transparent to both clients and servers.
  • NDFS Net Driver NDFS Net Driver
  • This driver implements its own layer-3 protocol and only reverts to IP/UDP in case of communication problems.
  • a database instead of using the Windows file system for the LSE, a database, NDFS DB, can be used.
  • NDFS DB a database implemented LSE can also prevent users from manipulating the raw data stored in a hidden directory as in the implementation of FIG. 3 .
  • every protocol packet comprises:
  • every function has a unique id, and the highest order byte defines whether the given function is BROADCAST (1) or UNICAST (2) based.
  • the functions can be categorized as carrying data or metadata (directory operations). Also defined are control functions such as PING, which do not directly influence the file system or data.
  • all metadata (directory information) is available on every participating node. All functions manipulating metadata are therefore BROADCAST based and do not require two way communications—the node modifying data is sent as a broadcast message to all other nodes to update the metadata. Verification of such operations is performed only on the requesting node.
  • the rest of the metadata functions are used to read directory contents and are used in the recovery process. These functions are unicast based, because the implementations assume metadata to be consistent on all available nodes.
  • the last fragment After fragmentation of a file into clusters, the last fragment usually has a random size smaller than the full cluster size (unless the file size is rounded up to the full cluster size). Such a fragment cannot easily be distributed using N+M redundancy and is stored using 1+M redundancy (replication) using the function WRITE_MIRRORED. This is also valid for files that are smaller than cluster size. (Alternative implementations may have different functionality such as padding or reducing block size to 1 byte.)
  • WRITE_MIRRORED is a BROADCAST function because an identical data portion is replicated to all nodes. It should be noted that for READ_MIRRORED operations, all data is available locally (because it is identical on every node) and no network I/O is required for such small portions of data (except for recovery purposes).
  • the mirrored block size has to be smaller than cluster size, however it can be larger than NBS size. In such cases more than one WRITE_MIRRORED packet has to be sent with a different offset for the data being written.
  • clusters are divided into individual packets.
  • the broadcast function READ_REQUEST is used to read data from a file.
  • the function is sent to all nodes with the cluster offset to be retrieved. Every node replies with unicast function READ_REPLY with its own data for the cluster at NBS size.
  • the node performing READ_REQUEST waits for first number of data nodes READ_REPLY packets sufficient to recover the data. If enough packets are received, any following reply packets are discarded. The data then is processed by an N+M redundancy function to recover the original file data.
  • REQUEST/REPLY Functions like REQUEST/REPLY have a 64-bit unique identification number generated from the computer's system clock inserted while sending REQUEST.
  • the packet ID is stored to a queue.
  • the REQUEST ID is removed from the queue. Packets with IDs not matching those in the queue are discarded.
  • the packet ID is also used in functions other than REQUEST/REPLY to prevent execution of functions on the same node as the sending node.
  • a node receives a REQUEST packet with an ID matching a REQUEST ID in the REQUEST queue, the REQUEST is removed from the queue. Otherwise the REQUEST function in the packet will be executed.
  • the broadcast function PING_REQUEST is sent when the networking thread is started on a given node.
  • the node receives a number of unicast responses PING_REPLY from the other nodes, and if these are less than required, the VSP is suspended until quorum is reached.
  • Every other node starting up sends following PING_REQUEST packets and this can be used to indicate to a node that the required number of nodes are now available, so that VSP operations can be resumed for read-only or read/write.
  • the PING functions are used to establish the closest (lowest latency) machine to the requesting node and this is used when recovery is performed.
  • re-sync and recovery are initiated when a node starts up and connects to the network that has already reached quorum. This is done to synchronize any changes made to files when the node was off the network.
  • every file in every directory is marked with a special attribute. The attribute is removed after recovery is performed. During the recovery operation the disk is not visible to the local user.
  • remote nodes can perform I/O operations on the locally stored files not marked with the recovery attribute. This ensures that data cannot be corrupted by desynchronization.
  • the recovering node reads the directory from the lowest latency node using QUERY_DIR_REQUEST/RESPONSE functions.
  • the directory is compared to locally stored metadata for the VSP.
  • Every WRITE and WRITE_MIRRORED request carry a requesting node generated timestamp in the packet payload and this timestamp is assigned to the metadata for the file/directory on every node.
  • Per-file data recovery process is performed by first retrieving the file size from the metadata (which prior to data recovery has to be “metadata recovered”). Then the file size is divided into cluster sizes and standard READ_REQUESTS performed to retrieve the data. An exception is the last cluster which is retrieved from the metadata source node (lowest latency) using READ_MIRRORED_REQUEST. The last part of recovery process comprises setting proper metadata parameters (size, attributes, last modification time) on the file.
  • File and attribute comparison is performed recursively for all files and folders on the disk storage. When recovery is finished all data is in sync and normal operations are resumed.
  • Alternative implementations of the invention can have dynamic recovery as opposed to recovery on startup only.
  • the networking thread can detect that the node lost communication with the other nodes and perform recovery each time communication is restored.
  • journaling can assist such recovery and the node could periodically check the journal or its serial number to detect if any changes have been made that the node was unaware of. Also the journal checking and metadata and last cluster recovery should be performed in more distributed manner than just trusting the node with lowest latency.
  • NODE_ID a dedicated NODE_ID
  • Administration functions could also be expanded to enable one node to be replaced with another node in the VSP, individual nodes to be added or removed from the VSP or the VSP geometry to be changed.
  • VSP could be implemented in a way that represents a continuous random access device formatted with a native file system such as FAT, NTFS or EXT/UFS on Unix.
  • the VSP could also be used as virtual magnetic tape device for storing backups using traditional backup software.
  • Native Filesystem usage represents a potential problem where multiple nodes, while updating the same volume, could corrupt the VSP file system meta data due to multi node locking. To mitigate this, either a clustered filesystem would be used, or each node could access only a separate virtualized partition at a time.
  • a HA Resource Group traditionally comprises a LUN or Disk Volume or partition residing on a shared storage (disk array or SAN) that is used only by this Resource Group and moves between nodes together with other resources.
  • a LUN or partition could be replaced with NDFS VSP formed out of cluster nodes and internal disks, so removing HA cluster software dependency on shared physical storage.

Abstract

A storage pool component is operable on a computing device including a storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other such component. The component comprises configuration data identifying at least one other computing device to which the computing device may connect across the network; and a directory for identifying file information for files of the storage pool stored on the storage medium, the file information being stored with a degree of redundancy across the computing devices of the storage pool. On instantiation, the component for communicates with at least one other component operating on one of the other computing devices to verify the contents of the directory. The component reconciles file information stored on the storage medium with file information from the remainder of the storage pool. The component then acts as a driver, responsive to an access request for a file stored in the storage pool received across the network from another component of the storage pool, for determining a location of the file on the storage medium from the directory and for accessing the file accordingly.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • n/a
  • STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • n/a
  • FIELD OF THE INVENTION
  • The present invention relates to a method and apparatus for distributing files within a network.
  • BACKGROUND OF THE INVENTION
  • Traditionally computer data is usually stored in form of individual files in a computer long-term storage (non-volatile). This most commonly is a “hard disk”. Hard disks suffer from following issues:
  • Limited capacity
  • Prone to damage and failure because of mechanical (moving) parts—short lifetime
  • Not shared, generally only one machine can access it at a time
  • To overcome such problems, disks can be combined in to larger pools of storage with data protection, for example, in a RAID array. In terms of their interface to a host computer, disks or pools of disks can be either be:
  • Internal—e.g. IDE, SATA, SCSI disks.
  • External—DAS Directly Attached Storage e.g. USB, SCSI, Fiber Channel. However, DAS is only capable being connected to very limited (<10) number of servers at a time.
  • External—NAS Network Attached Storage e.g. Ethernet, TCP/IP, IPX/SPX. NAS is just a more advanced, specifically designed in hardware file server.
  • External—SAN Storage Area Network e.g. Fiber Channel (FC) network infrastructure. It is acknowledged that SAN is capable of being connected to multiple machines however the infrastructure costs for doing so are prohibitive for desktops and, in spite of improvements such as iSCSI, SAN is typically used only for servers.
  • A pool of storage usually needs to be accessible by more than just a single machine. Traditionally the most common way of sharing storage is to use a “file server” which is a dedicated computer on the network, providing its storage pool (connected through any of the above 4 ways, internal or external) transparently to other computers by a “File Sharing Protocol” over a computer network (LAN/WAN/etc) with the possibility of adding extra security (access control) and availability (backups) from a central location.
  • Some commonly used file sharing protocols are:
      • CIFS/SMB/Windows File Sharing introduced by Microsoft with Windows 3.x
      • NFS introduced by Sun Microsystems and adopted by almost all Unix operating systems
      • Netware introduced by Novel
      • Apple Share used by Apple computers
  • However file servers suffer from some serious issues:
  • Single point of failure—when server fails all clients are unable to access data
      • Central bottleneck—when multiple clients are accessing same server the network, congestion can occur
      • Limited capacity and scalability—file servers can run out of space when more clients are connected
      • Expensive dedicated hardware and per gigabyte cost
        Maintenance costs, upgrades, service, repairs, etc.
    SUMMARY OF THE INVENTION
  • The present invention provides a virtual storage pool from a combination of unused disk resources from participating nodes presented as a single virtual disk device (of combined size) available to and shared with all nodes of a network. Under a host operating system, the virtual storage pool is visible as normal disk drive (disk letter on Windows and mount point on Unix), however all disk I/O is distributed to participating nodes over the network. In the present specification, this is referred to as a Network Distributed File System (NDFS).
  • If any of the peer workstations becomes unavailable (even for a short period of time) the virtual storage pool could become unavailable or inconsistent. In preferred embodiments of the invention, to achieve availability comparable to a server or NAS storage, data is distributed in such a way that if any number (of predefined) participating nodes become unavailable, the virtual storage pool is still accessible to the remaining computers.
  • The invention is based on a Peer-to-Peer (P2P) network protocol which allows data to be stored and retrieved in a distributed manner. The data redundancy mechanism is used at the protocol level in such a way that if any of participating nodes is inaccessible or slow in response to requests, they are automatically omitted in the processing. The protocol therefore is network congestion or break resistant.
  • For example, given a network of 25 workstations, each of which having 120 GB disk of which 100 GB is unused, a virtual storage pool of size 2.5 TB could be formed and made available to all nodes on the network.
  • The storage pool of the preferred embodiments, in contrast to the traditional file server approach has following characteristics:
      • No single point of failure—up to predefined number of participating nodes (peers) can become inaccessible and the data will still be available.
      • Reduced network bottlenecks—using a P2P network protocol provides the benefits of parallel I/O. The load is generally evenly and coherently distributed across the network as opposed to point-to-point transmission with a client-server model. Also the network protocol automatically adapts to network congestion by not requiring all data to be retrieved and ignoring nodes which are slow to respond.
      • Unlimited capacity and scalability—the size of the storage automatically grows with every node added to the network.
      • No extra hardware costs, more than that, resources that would otherwise be unused can be utilized.
  • In accordance with one aspect, the present invention provides a storage pool component operable on a computing device including a storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other storage pool component. Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity. The storage pool component has configuration data identifying the at least one other computing device to which the computing device may connect across the network, a directory for identifying file information for files of the storage pool stored on the storage medium, the file information being stored with a degree of redundancy across the computing devices of the storage pool, means responsive to instantiation of the component for communicating with at least one other component operating on one of the at least one other computing devices for verifying the contents of the directory, means for reconciling file information stored on the storage medium with file information from the remainder of the storage pool, and a driver. The driver is responsive to an access request for a file stored in the storage pool received across the network from another component of the storage pool, and determines a location of the file on the storage medium from the directory and for accessing the accordingly.
  • In accordance with another aspect, the present invention provides a system having a plurality of computing devices. The plurality of computing devices each has a storage medium. At least one of the computing devices includes a storage pool component. The storage pool component is operable on the computing device and the storage medium has an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and is operably connected across a network to at least one other storage pool component. Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity. The system also includes one or more legacy clients accessing the storage pool through a legacy disk device driver.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • A more complete understanding of the present invention, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:
  • FIG. 1 shows schematically a pair of virtual storage pools (VSP) distributed across a set of nodes according to an embodiment of the present invention;
  • FIG. 2 shows a client application accessing a virtual storage pool according to an embodiment of the invention;
  • FIG. 3 shows the main components within a Microsoft Windows client implementation of the invention;
  • FIG. 4 shows the main components within an alternative Microsoft Windows client implementation of the invention;
  • FIG. 5 a write operation being performed according to an embodiment of the invention; and
  • FIG. 6 shows a cluster of VSPs in a high availability group.
  • DETAILED DESCRIPTION OF THE INVENTION
  • In traditional data storage, the term “Storage Pool” refers to a pool of physical disks or logical disks served by a SAN or LUNs (Logical Units) in DAS. Such a storage pool can be used either as a whole or partially, to create a higher level “logical volume(s)” by means of RAID-0, 1, 5, etc. before being finally presented through the operating system.
  • According to the present invention, a storage pool is created through a network of clients communicating through a P2P network protocol. For the purposes of the present description, the network includes the following node types:
      • Server—node contributing local storage resources (free disk space) to the storage pool through the P2P protocol. A server receives requests from clients to either provide or modify data stored at that node. As such it will be seen that server nodes can be regular workstations that need not run applications which access the VSP, dedicated network servers (such as Windows Server, Unix Server, etc), Network Attached Storage (NAS) or Storage Area Network (SAN) devices.
      • Active Client—node accessing the storage pool through the P2P protocol. An Active Client, when accessing the data, communicates with all available servers simultaneously. When one or more of the servers is delayed in responding or is completely unavailable, the missing piece of data can be rebuilt using redundant chunks of data obtained from the other servers.
      • Peer Node—a node, which is both Server and Active Client at same time. This is the most common node type.
      • Legacy Client—a node accessing the server pool through a legacy protocol like CIFS or NFS, through a gateway on an Active Client.
  • From the above it will be seen that a virtual storage pool can be created from Peer Nodes or Server Nodes and accessible for Active Clients or Peer Nodes. VSP therefore can function for example on:
      • workstations sharing their own LSEs with their shared VSP—Active Nodes;
      • workstations sharing their own LSEs with external Active (or Legacy) Clients—Server/Peer Nodes to Active Nodes;
      • network servers, sharing their own LSEs with their common shared VSP—Active Modes;
      • network servers (for example NAS or SAN) sharing their own LSE with external network Active (or Legacy) Clients—Server/Peer Nodes to Active Nodes;
      • mixed network of active clients and servers sharing their LSE with their common shared VSP—Active Nodes; or
      • mixed network of active clients and servers sharing their LSE with external Active (or Legacy) Clients—Server/Peer Nodes to Active Nodes.
  • Referring now to FIG. 1, a VSP (Virtual Storage Pool), VSP A or VSP B, according to the preferred embodiment is formed up from Local Storage Entities (LSE) served by either Server or Peer Nodes 1 . . . 5. In a simple implementation, an LSE can be just a hidden subdirectory on a disk of the node. However, alternative implementations referred to later could implement an LSE as an embedded transactional database. In general, LSE size is determined by the available free storage space on the various nodes contributing to the VSP. Preferably, LSE size is the same on every node, and so global LSE size within a VSP will be dependent on smallest LSE in the VSP.
  • The size of VSP is calculated on VSP Geometry:
      • If no data redundancy is used (Geometry=N), the size of the VSP is determined by the number N of Server or Peer Nodes multiplied by size of the LSE.
      • When mirroring (M replicas) is being used (Geometry=1+M), the size of the VSP is equal to the size of the LSE.
  • When RAID3/5 is being used (Geometry=N+1), the size of the VSP equals N+1 multiplied by size of LSE.
  • When RAID-6 is being used (Geometry=N+2), the size of VSP equals N+2 multiplied by size of LSE.
  • If N+M redundancy is used (Geometry=N+M), the size of VSP equals N multiplied by the size of LSE.
  • Because the LSE is the same on every node, a situation may occur when one or few nodes having a major storage size difference could be under utilized in contributing to virtual network storage. For example in a workgroup of 6 nodes, two nodes having 60 GB disks and four having 120 GB disks, the LSE on two nodes may be only 60 GB, and so single VSP size could only be 6*60 GB=360 GB as opposed to 120+120+120+120+60+60=600 GB. In such a situation, multiple VSPs can be defined. So in the above example, two VSPs could be created, one 6*60 GB and a second 4*60 GB, and these will be visible as two separate network disks. In fact, multiple VSPs enable different redundancy levels and security characteristics to be applied to different VSPs, so enabling greater flexibility for administrators.
  • Using the invention, a VSP is visible to an Active Client, Peer Node or indeed Legacy Client as a normal disk formed from the combination of LSEs with one of the geometries outlined above. When a client stores or retrieves data from a VSP it attempts to connect to every Server or Peer Node of the VSP and to perform an LSE I/O operation with an offset based on VSP Geometry.
  • Before describing an implementation of the invention in detail, we define the following terms:
      • LSE Block Size (LBS) is a minimal size of data that can be accessed on an LSE. Currently it is hard coded at 1024 bytes.
      • Network Block Size (NBS) is a maximum size of data payload to be transferred in a single packet. Preferably, NBS is smaller than the network MTU (Maximum Transmission Unit)/MSS (Maximum Segment Size) and in the present implementations NBS is equal to LBS, i.e. 1024 bytes, to avoid network fragmentation. (Standard MTU size on an Ethernet type network is 1500 bytes).
      • VSP Block Size (VBS) is the size of data block at which data is distributed within the P2P network: VBS=LBS*number of non-redundant nodes (N).
  • VSP Cluster Size (VCS)—data (contents of the files before redundancy is calculated) is divided into so called clusters, similar in to data clusters of traditional disk based file systems (FAT, NTFS). Cluster size is determined by VSP Geometry and NBS (Network Block Size) in following way:

  • VCS=Number of Data Nodes*NBS
      • VCS is a constant data size that a redundancy algorithm can be applied to. If a data unit is smaller than VCS, mirroring is used. If data unit is larger than VCS it will be wrapped to a new cluster. For example, with reference to FIG. 5, if a VSP has 5 data nodes and the NBS is 1400 bytes, the VCS would be 5*1400=7000 bytes. If a client application performs a write I/O operation of 25 kilobytes of data, the NDFS will split it to three clusters (of 7000 bytes) and remaining 4000 bytes will be mirrored among nodes. Another implementation would pad the remaining 4000 bytes with 3000 zeros up to full cluster size and distribute among nodes as a fourth cluster.
      • Host Block Size (HBS) is the block size used on a host operating system.
  • Referring now to the implementation of FIG. 3 where only Peer Nodes and a single VSP per network are considered. In this implementation, a simple user mode application (u_ndfs.exe) is used for startup, maintenance, recovery, cleanup, VSP forming, LSE operations and the P2P protocol, however, it will be seen that separate functionality could equally be implemented in separate applications.
  • Upon startup, u_ndfs.exe reads config.xml, a configuration file, which defines LSE location and VSP properties i.e. geometry, disk name and IP addresses of peer nodes. (The configuration file is defined through user interaction with a configuration GUI portion (CONFIG GUI) of U_ndfs.) U_ndfs then spawns a networking P2P protocol thread, NDFS Service. The network protocol used by the thread binds to a local interface on a UDP port and starts network communications with other nodes contributing to the VSP.
  • If less than a quorum N of N+M nodes are detected by the node on start-up, the VSP is suspended for that node until a quorum is reached.
  • Where there is N+M redundancy and where N<=M, it is possible for two separate quorums to exist on two detached networks. In such a case, if N<=50% of N+M, but a quorum is reached at a node, the VSP is set to read-only mode at that node.
  • Once a quorum is present, local LSE to VSP directory comparison is performed by recovering directory metadata from another node.
  • If the VSP contains any newer files/directories than the local LSE (for instance if the node has been off the network and files/directories have been changed), a recovery procedure is performed by retrieving redundant network parts from one or more other nodes and rebuilding LSE data for the given file/directory. In a simple implementation, for recovery, the node closest to the requesting node based on network latency is used as the source for metadata recovery.
  • So for example, in an N+M redundancy VSP implementation, a file is split into N+M clusters, each cluster containing a data component and a redundant component. Where one or more the N+M nodes of the VSP was unavailable when the file was written or updated, during recovery, the previously unavailable node must obtain at least N of the clusters in order to rebuild the cluster which should be stored for the file on the recovering node to maintain the overall level of redundancy for all files of the VSP.
  • It will also be seen that, after start-up and recovery, the networking protocol should remain aware of network failure and needs to perform an LSE rescan and recovery every time the node is reconnected to the network. The user should be alerted to expect access to the VSP when this happens.
  • A transaction log can be employed to speed up the recovery process instead of using a directory scan, and if the number of changes to the VSP exceeds the log size, a full recovery could be performed.
  • It can also be useful during recovery to perform full disk scan in a manner of fsck (“file system check” or “file system consistency check” in UNIX) or chkdsk (Windows) to ensure files have not been corrupted.
  • When LSE data is consistent with the VSP, the networking thread begins server operations and u_ndfs.exe loads a VSP disk device kernel driver (ndfs.sys). The disk device driver (NFDS Driver) then listens to requests from the local operating system and applications, while u_ndfs.exe listens to requests from other nodes through the networking thread.
  • Referring to FIG. 2, in operation, an application (for instance Microsoft Word) running on the host operating system, calls the I/O subsystem in the OS kernel and requests a portion of data with an offset (0 to file length) and size. (If the size is bigger than HBS, the kernel will fragment the request to smaller subsequent requests). The I/O subsystem then sends an IRP (I/O request packet) message to the responsible device driver module, NFDS driver. In case of a request to the VSP, the kernel device driver receives the request and passes it on to the P2P network protocol thread, NDFS Service, for further processing based on the VSP geometry.
  • At the same time, when the server side of the networking thread receives a request from a client node through the network, an LSE I/O operation is performed on the local storage.
  • Both client and server I/Os can be thought of as normal I/O operations with an exception that they are intercepted and passed through the NDFS driver and NDFS service like a proxy. N+M redundancy can thus be implemented with the P2P network protocol transparent to both clients and servers.
  • Referring now to FIG. 4, in further refined implementation of the invention, a separate kernel driver, NDFS Net Driver, is implemented for high-speed network communications instead of using Winsock. This driver implements its own layer-3 protocol and only reverts to IP/UDP in case of communication problems.
  • Also, instead of using the Windows file system for the LSE, a database, NDFS DB, can be used. Such a database implemented LSE can also prevent users from manipulating the raw data stored in a hidden directory as in the implementation of FIG. 3.
  • For the implementation of FIG. 3, a P2P network protocol is used to provide communications between VSP peer nodes on the network. Preferably, every protocol packet comprises:
  • Protocol ID
  • Protocol Version
  • Geometry
  • Function ID
  • Function Data
  • For the implementations of FIGS. 3 and 4, the following functions are defined:
  • NDFS_FN_READ_FILE_REQUEST 0x0101
    NDFS_FN_READ_FILE_REPLY 0x0201
    NDFS_FN_WRITE_FILE 0x0202
    NDFS_FN_CREATE_FILE 0x0102
    NDFS_FN_DELETE_FILE 0x0103
    NDFS_FN_RENAME_FILE 0x0104
    NDFS_FN_SET_FILE_SIZE 0x0105
    NDFS_FN_SET_FILE_ATTR 0x0106
    NDFS_FN_QUERY_DIR_REQUEST 0x0207
    NDFS_FN_QUERY_DIR_REPLY 0x0203
    NDFS_FN_PING_REQUEST 0x0108
    NDFS_FN_PING_REPLY 0x0204
    NDFS_FN_WRITE_MIRRORED 0x0109
    NDFS_FN_READ_MIRRORED_REQUEST 0x0205
    NDFS_FN_READ_MIRRORED_REPLY 0x0206
  • As can be seen above, every function has a unique id, and the highest order byte defines whether the given function is BROADCAST (1) or UNICAST (2) based.
  • The functions can be categorized as carrying data or metadata (directory operations). Also defined are control functions such as PING, which do not directly influence the file system or data.
  • Functions, which carry data are as follows:
  • READ_REQUEST
  • READ_REPLY
  • WRITE
  • WRITE_MIRRORED
  • READ_MIRRORED_REQUEST
  • READ_MIRRORED_REPLY
  • whereas functions, which carry metadata are as follows:
      • CREATE—creates a file or directory with a given name and attributes
      • DELETE—deletes a file or directory with it's contents
      • RENAME—renames a file or directory or it's localization in directory structure (MOVE)
      • SET_ATTR—changes file attributes
      • SET_SIZE—sets file size. Note that the file size doesn't imply how much space the file physically occupies on the disk and is only an attribute.
      • QUERY_DIR_REQUEST
      • QUERY_DIR_REPLY
  • In the present implementations, all metadata (directory information) is available on every participating node. All functions manipulating metadata are therefore BROADCAST based and do not require two way communications—the node modifying data is sent as a broadcast message to all other nodes to update the metadata. Verification of such operations is performed only on the requesting node.
  • The rest of the metadata functions are used to read directory contents and are used in the recovery process. These functions are unicast based, because the implementations assume metadata to be consistent on all available nodes.
  • After fragmentation of a file into clusters, the last fragment usually has a random size smaller than the full cluster size (unless the file size is rounded up to the full cluster size). Such a fragment cannot easily be distributed using N+M redundancy and is stored using 1+M redundancy (replication) using the function WRITE_MIRRORED. This is also valid for files that are smaller than cluster size. (Alternative implementations may have different functionality such as padding or reducing block size to 1 byte.)
  • WRITE_MIRRORED is a BROADCAST function because an identical data portion is replicated to all nodes. It should be noted that for READ_MIRRORED operations, all data is available locally (because it is identical on every node) and no network I/O is required for such small portions of data (except for recovery purposes).
  • Note that the mirrored block size has to be smaller than cluster size, however it can be larger than NBS size. In such cases more than one WRITE_MIRRORED packet has to be sent with a different offset for the data being written.
  • In implementing N+M redundancy, clusters are divided into individual packets. To read data from a file, the broadcast function READ_REQUEST is used. The function is sent to all nodes with the cluster offset to be retrieved. Every node replies with unicast function READ_REPLY with its own data for the cluster at NBS size.
  • The node performing READ_REQUEST waits for first number of data nodes READ_REPLY packets sufficient to recover the data. If enough packets are received, any following reply packets are discarded. The data then is processed by an N+M redundancy function to recover the original file data.
  • Functions like REQUEST/REPLY have a 64-bit unique identification number generated from the computer's system clock inserted while sending REQUEST. The packet ID is stored to a queue. When the required amount of REPLY packets with same ID is received, the REQUEST ID is removed from the queue. Packets with IDs not matching those in the queue are discarded.
  • The packet ID is also used in functions other than REQUEST/REPLY to prevent execution of functions on the same node as the sending node. When a node receives a REQUEST packet with an ID matching a REQUEST ID in the REQUEST queue, the REQUEST is removed from the queue. Otherwise the REQUEST function in the packet will be executed.
  • The broadcast function PING_REQUEST is sent when the networking thread is started on a given node. In response, the node receives a number of unicast responses PING_REPLY from the other nodes, and if these are less than required, the VSP is suspended until quorum is reached.
  • Every other node starting up sends following PING_REQUEST packets and this can be used to indicate to a node that the required number of nodes are now available, so that VSP operations can be resumed for read-only or read/write.
  • The PING functions are used to establish the closest (lowest latency) machine to the requesting node and this is used when recovery is performed. As explained above, re-sync and recovery are initiated when a node starts up and connects to the network that has already reached quorum. This is done to synchronize any changes made to files when the node was off the network. When the recovery process is started, every file in every directory is marked with a special attribute. The attribute is removed after recovery is performed. During the recovery operation the disk is not visible to the local user. However, remote nodes can perform I/O operations on the locally stored files not marked with the recovery attribute. This ensures that data cannot be corrupted by desynchronization.
  • The recovering node reads the directory from the lowest latency node using QUERY_DIR_REQUEST/RESPONSE functions. The directory is compared to locally stored metadata for the VSP. When comparing individual files, the following properties are taken into consideration:
      • Name—if the file is present on the source machine and not present on the local node, the file will be created using the received metadata and the file recovery process will be performed. If the file exists on the local node and does not exist on the remote node it will be removed locally. Exactly same protocol applies to directories (which are accessed recursively).
      • Size of file—if the locally stored file size is different to the source node the file, it is removed and recovered.
      • Last modification time—if the modification time is different the file is deleted and recovered.
      • File attributes (e.g. read-only, hidden, archive)—unlike the previous parameters, in case of a difference in file attributes, the file is not deleted and recovered, instead only the attributes are applied. In more extensive implementations, attributes such as Access Control List (ACL) and security information can be applied. Also, some implementation may also include several additional attributes such as file versioning or snapshots.
  • Note that last modification time recovery wouldn't make sense if local time would be used on every machine. Instead every WRITE and WRITE_MIRRORED request carry a requesting node generated timestamp in the packet payload and this timestamp is assigned to the metadata for the file/directory on every node.
  • Per-file data recovery process is performed by first retrieving the file size from the metadata (which prior to data recovery has to be “metadata recovered”). Then the file size is divided into cluster sizes and standard READ_REQUESTS performed to retrieve the data. An exception is the last cluster which is retrieved from the metadata source node (lowest latency) using READ_MIRRORED_REQUEST. The last part of recovery process comprises setting proper metadata parameters (size, attributes, last modification time) on the file.
  • File and attribute comparison is performed recursively for all files and folders on the disk storage. When recovery is finished all data is in sync and normal operations are resumed.
  • Alternative implementations of the invention can have dynamic recovery as opposed to recovery on startup only. For example, the networking thread can detect that the node lost communication with the other nodes and perform recovery each time communication is restored.
  • As mentioned above, a live transaction log file (journaling) can assist such recovery and the node could periodically check the journal or its serial number to detect if any changes have been made that the node was unaware of. Also the journal checking and metadata and last cluster recovery should be performed in more distributed manner than just trusting the node with lowest latency.
  • While the above implementations have been described as implemented in Windows platforms, it will be seen that the invention can equally be implemented with other operating systems, as despite operating system differences a similar architecture to that shown in FIGS. 3 and 4 can be used.
  • In more extensive implementations of the invention, different security models can be applied to a VSP:
      • Open Access—no additional security mechanisms, anyone with a compatible client can access the VSP. Only collision detection will have to be performed to avoid data corruption. Standard Windows ACLs and Active Directory authentication will apply.
      • Symmetric Key Access—a node trying to access VSP will have to provide a shared pass-phrase. The data on LSE and/or protocol messages will be encrypted and the pass-phrase will be used to decrypt data on fly when doing N+M redundancy calculations.
      • Certificate Security—in this security model, when forming a VSP, every node will have to exchange it's public keys with every other node on the network. When a new node tries to access the VSP it will have to be authorized on every existing participating node (very high security).
  • While the implementations above have been described in terms of active clients, servers and peer nodes, it will be seen that the invention can easily be made available to legacy clients, for example, using Windows Share. It may be particularly desirable to allow access only to clients which are more likely not be highly available, for example, a laptop, as becoming a peer in a VSP could place an undue recovery burden, not only the laptop but on other nodes participating in the VSP, as the laptop connects and disconnects from the network.
  • Further variations of the above described implementations are also possible. So for example, rather than using an IP or MAC to identify nodes participating in a VSP, a dedicated NODE_ID could be used. Administration functions could also be expanded to enable one node to be replaced with another node in the VSP, individual nodes to be added or removed from the VSP or the VSP geometry to be changed.
  • Additionally the VSP could be implemented in a way that represents a continuous random access device formatted with a native file system such as FAT, NTFS or EXT/UFS on Unix. The VSP could also be used as virtual magnetic tape device for storing backups using traditional backup software.
  • Native Filesystem usage represents a potential problem where multiple nodes, while updating the same volume, could corrupt the VSP file system meta data due to multi node locking. To mitigate this, either a clustered filesystem would be used, or each node could access only a separate virtualized partition at a time.
  • For example, in a High Availability cluster such as Microsoft Cluster Server, Sun Cluster or HP Serviceguard, a HA Resource Group traditionally comprises a LUN or Disk Volume or partition residing on a shared storage (disk array or SAN) that is used only by this Resource Group and moves between nodes together with other resources. Referring now to FIG. 6, such a LUN or partition could be replaced with NDFS VSP formed out of cluster nodes and internal disks, so removing HA cluster software dependency on shared physical storage.
  • It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described herein above. In addition, unless mention was made above to the contrary, it should be noted that all of the accompanying drawings are not to scale. A variety of modifications and variations are possible in light of the above teachings without departing from the scope and spirit of the invention, which is limited only by the following claims.

Claims (23)

1. A storage pool component operable on a computing device including a storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other storage pool component, each storage pool component operating on a computing device providing a respective portion of said storage pool capacity, said storage pool component comprising:
configuration data identifying said at least one other computing device to which said computing device may connect across said network;
a directory for identifying file information for files of said storage pool stored on said storage medium, said file information being stored with a degree of redundancy across said computing devices of said storage pool;
means responsive to instantiation of said component for communicating with at least one other component operating on one of said at least one other computing devices for verifying the contents of said directory;
means for reconciling file information stored on said storage medium with file information from the remainder of said storage pool; and
a driver, responsive to an access request for a file stored in said storage pool received across said network from another component of said storage pool, for:
determining a location of said file on said storage medium from said directory;
accessing said file accordingly.
2. A component as claimed in claim 1, wherein the component further comprises a user interface component arranged to enable said configuration data to be determined.
3. A component as claimed in claim 1, wherein said access request comprises a read access and wherein said driver is arranged to return said file information to said requesting component.
4. A component as claimed in claim 1, wherein said access request comprises a write access including file information and wherein said driver is arranged to write said file information to said storage medium and to update said directory accordingly.
5. A component as claimed in claim 1, wherein said configuration data includes an identifier for said storage pool, storage size information for said storage pool, an indicator of said redundancy provision within said storage pool, and network identifiers for other components of said storage pool.
6. A component as claimed in claim 1, wherein said component is arranged to operate as a disk device driver on said computing device, said driver being arranged to receive file access requests from any applications running on said computing device and in accordance with said directory to transmit file access requests to other components of said storage pool, to process responses to said requests and to communicate the processing of said responses to said applications.
7. A component as claimed in claim 6, wherein said access request comprises a request for file information from another component of said storage pool, said file information being distributed across N+M computing devices, where N>=1 and determines the amount of storage available in said storage pool and wherein M>0 and determines said redundancy provision within said storage pool.
8. A component as claimed in claim 7, wherein said component is responsive to a file write request to split said file information into N+M clusters and to transmit file write requests to other components of said storage pool, each request including at least a respective write access request to a component.
9. A component as claimed in claim 7, wherein said component is responsive to a file write request to split said file information into clusters of a given size, to transmit file write requests to other components of said storage pool, each request including at least a respective write access request to a component and to transmit a write request including residual file information from said splitting to at least M components of said storage pool.
10. A component as claimed in claim 6, wherein said component is arranged to determine how many other components of said storage pool are accessible across said network.
11. A component as claimed in claim 10, wherein said component is responsive to less than N of N+M components being accessible to halt access to said storage pool.
12. A component as claimed in claim 10, wherein N<=M and said component is responsive to less than 50% of said components being accessible to permit only read access requests to said storage pool.
13. A component as claimed in claim 1, wherein said component is arranged to provide storage capacity for respective portions of a plurality of storage pools, said configuration data including data for each storage pool.
14. A component as claimed in claim 1, wherein said component is arranged to make said storage pool available as a disk drive.
15. A component as claimed in claim 1 wherein said file information is stored in blocks in a directory of said storage medium.
16. A component as claimed in claim 1 wherein said file information is stored as objects in a transactional database.
17. A component as claimed in claim 16, wherein said verifying means is arranged to compare transaction log entries stored on said component with transaction log entries stored on another component of said storage pool to determine if file information stored on said component is valid.
18. A component as claimed in claim 1 wherein said verifying means is arranged to compare at least one of: file name, file size, last modification date, and file attributes contained in said directory with corresponding attributes for a file stored on another component of said storage pool to determine if file information stored on said component is valid.
19. A component according to claim 1, wherein said component is arranged to periodically check the accessibility of other components forming the storage pool to said client.
20. A component according to claim 19, wherein said verifying means is arranged to communicate with the most accessible of said other components.
21. A component as claimed in claim 6, wherein said file access requests are transmitted to all accessible components of said storage pool.
22. A component according to claim 1, wherein said component is arranged to communicate with other components of said storage pool in an encrypted manner determined by
23. A system, comprising:
a plurality of computing devices having:
a storage medium;
at least one of said computing devices comprising a storage pool component, said storage pool component being operable on the computing device, the storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other storage pool component, each storage pool component operating on a computing device providing a respective portion of said storage pool capacity, said storage pool component comprising:
configuration data identifying said at least one other computing device to which said computing device may connect across said network;
a directory for identifying file information for files of said storage pool stored on said storage medium, said file information being stored with a degree of redundancy across said computing devices of said storage pool;
means responsive to instantiation of said component for communicating with at least one other component operating on one of said at least one other computing devices for verifying the contents of said directory;
means for reconciling file information stored on said storage medium with file information from the remainder of said storage pool; and
a driver, responsive to an access request for a file stored in said storage pool received across said network from another component of said storage pool, for:
determining a location of said file on said storage medium from said directory; and
accessing said file accordingly,
said storage pool component being arranged to make said storage pool available as a disk drive; and
said system including one or more legacy clients accessing said storage pool through a legacy disk device driver.
US12/143,134 2007-06-22 2008-06-20 Network distributed file system Abandoned US20080320097A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US13/492,633 US20130151653A1 (en) 2007-06-22 2012-06-08 Data management systems and methods
US13/492,615 US9880753B2 (en) 2007-06-22 2012-06-08 Write requests in a distributed storage system
US14/519,003 US20150067093A1 (en) 2007-06-22 2014-10-20 Network Distributed File System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IES20070453 2007-06-22
IES2007/0453 2007-06-22

Related Child Applications (3)

Application Number Title Priority Date Filing Date
US13/492,615 Continuation-In-Part US9880753B2 (en) 2007-06-22 2012-06-08 Write requests in a distributed storage system
US13/492,633 Continuation-In-Part US20130151653A1 (en) 2007-06-22 2012-06-08 Data management systems and methods
US14/519,003 Continuation US20150067093A1 (en) 2007-06-22 2014-10-20 Network Distributed File System

Publications (1)

Publication Number Publication Date
US20080320097A1 true US20080320097A1 (en) 2008-12-25

Family

ID=40137642

Family Applications (4)

Application Number Title Priority Date Filing Date
US12/143,134 Abandoned US20080320097A1 (en) 2007-06-22 2008-06-20 Network distributed file system
US13/492,633 Abandoned US20130151653A1 (en) 2007-06-22 2012-06-08 Data management systems and methods
US13/492,615 Expired - Fee Related US9880753B2 (en) 2007-06-22 2012-06-08 Write requests in a distributed storage system
US14/519,003 Abandoned US20150067093A1 (en) 2007-06-22 2014-10-20 Network Distributed File System

Family Applications After (3)

Application Number Title Priority Date Filing Date
US13/492,633 Abandoned US20130151653A1 (en) 2007-06-22 2012-06-08 Data management systems and methods
US13/492,615 Expired - Fee Related US9880753B2 (en) 2007-06-22 2012-06-08 Write requests in a distributed storage system
US14/519,003 Abandoned US20150067093A1 (en) 2007-06-22 2014-10-20 Network Distributed File System

Country Status (2)

Country Link
US (4) US20080320097A1 (en)
IE (1) IES20080508A2 (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100274886A1 (en) * 2009-04-24 2010-10-28 Nelson Nahum Virtualized data storage in a virtualized server environment
US20100274784A1 (en) * 2009-04-24 2010-10-28 Swish Data Corporation Virtual disk from network shares and file servers
US20110246491A1 (en) * 2010-04-01 2011-10-06 Avere Systems, Inc. Method and apparatus for tiered storage
US8229961B2 (en) 2010-05-05 2012-07-24 Red Hat, Inc. Management of latency and throughput in a cluster file system
WO2013043439A1 (en) 2011-09-23 2013-03-28 Netapp, Inc. Storage area network attached clustered storage system
US20130117679A1 (en) * 2006-06-27 2013-05-09 Jared Polis Aggregation system
US9239840B1 (en) 2009-04-24 2016-01-19 Swish Data Corporation Backup media conversion via intelligent virtual appliance adapter
US9389926B2 (en) 2010-05-05 2016-07-12 Red Hat, Inc. Distributed resource contention detection
US9569457B2 (en) 2012-10-31 2017-02-14 International Business Machines Corporation Data processing method and apparatus for distributed systems
US20170346887A1 (en) * 2016-05-24 2017-11-30 International Business Machines Corporation Cooperative download among low-end devices under resource constrained environment
US9923784B2 (en) 2015-11-25 2018-03-20 International Business Machines Corporation Data transfer using flexible dynamic elastic network service provider relationships
US9923839B2 (en) 2015-11-25 2018-03-20 International Business Machines Corporation Configuring resources to exploit elastic network capability
US9923965B2 (en) 2015-06-05 2018-03-20 International Business Machines Corporation Storage mirroring over wide area network circuits with dynamic on-demand capacity
US10057327B2 (en) 2015-11-25 2018-08-21 International Business Machines Corporation Controlled transfer of data over an elastic network
US10177993B2 (en) 2015-11-25 2019-01-08 International Business Machines Corporation Event-based data transfer scheduling using elastic network optimization criteria
US10216441B2 (en) 2015-11-25 2019-02-26 International Business Machines Corporation Dynamic quality of service for storage I/O port allocation
US10248655B2 (en) 2008-07-11 2019-04-02 Avere Systems, Inc. File storage system, cache appliance, and method
US10338853B2 (en) 2008-07-11 2019-07-02 Avere Systems, Inc. Media aware distributed data layout
US10581680B2 (en) 2015-11-25 2020-03-03 International Business Machines Corporation Dynamic configuration of network features
US10614254B2 (en) * 2017-12-12 2020-04-07 John Almeida Virus immune computer system and method
US10642970B2 (en) * 2017-12-12 2020-05-05 John Almeida Virus immune computer system and method
US10929206B2 (en) * 2018-10-16 2021-02-23 Ngd Systems, Inc. System and method for outward communication in a computational storage device
CN113986124A (en) * 2021-10-25 2022-01-28 深信服科技股份有限公司 User configuration file access method, device, equipment and medium
US11418580B2 (en) * 2011-04-01 2022-08-16 Pure Storage, Inc. Selective generation of secure signatures in a distributed storage network
US20230125556A1 (en) * 2021-10-25 2023-04-27 Whitestar Communications, Inc. Secure autonomic recovery from unusable data structure via a trusted device in a secure peer-to-peer data network

Families Citing this family (189)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152524B2 (en) * 2012-07-30 2018-12-11 Spatial Digital Systems, Inc. Wavefront muxing and demuxing for cloud data storage and transport
US11614893B2 (en) 2010-09-15 2023-03-28 Pure Storage, Inc. Optimizing storage device access based on latency
TWI592805B (en) * 2010-10-01 2017-07-21 傅冠彰 System and method for sharing network storage and computing resource
US8589640B2 (en) 2011-10-14 2013-11-19 Pure Storage, Inc. Method for maintaining multiple fingerprint tables in a deduplicating storage system
CA2867302A1 (en) * 2012-03-14 2013-09-19 Convergent .Io Technologies Inc. Systems, methods and devices for management of virtual memory systems
US8706935B1 (en) * 2012-04-06 2014-04-22 Datacore Software Corporation Data consolidation using a common portion accessible by multiple devices
WO2013186061A1 (en) * 2012-06-15 2013-12-19 Alcatel Lucent Architecture of privacy protection system for recommendation services
US9678978B2 (en) 2012-12-31 2017-06-13 Carbonite, Inc. Systems and methods for automatic synchronization of recently modified data
US9418181B2 (en) * 2013-01-09 2016-08-16 Apple Inc. Simulated input/output devices
US20140337296A1 (en) * 2013-05-10 2014-11-13 Bryan Knight Techniques to recover files in a storage network
KR102170720B1 (en) * 2013-10-30 2020-10-27 삼성에스디에스 주식회사 Apparatus and Method for Changing Status of Clustered Nodes, and recording medium recording the program thereof
US9507674B2 (en) * 2013-11-22 2016-11-29 Netapp, Inc. Methods for preserving state across a failure and devices thereof
CN103780436B (en) * 2014-02-20 2018-06-08 中磊电子(苏州)有限公司 The relative connection keeping method of network equipment
CN106104502B (en) * 2014-03-20 2019-03-22 慧与发展有限责任合伙企业 System, method and medium for storage system affairs
US11531495B2 (en) * 2014-04-21 2022-12-20 David Lane Smith Distributed storage system for long term data storage
US9836234B2 (en) 2014-06-04 2017-12-05 Pure Storage, Inc. Storage cluster
US9003144B1 (en) 2014-06-04 2015-04-07 Pure Storage, Inc. Mechanism for persisting messages in a storage system
US10574754B1 (en) 2014-06-04 2020-02-25 Pure Storage, Inc. Multi-chassis array with multi-level load balancing
US11068363B1 (en) 2014-06-04 2021-07-20 Pure Storage, Inc. Proactively rebuilding data in a storage cluster
US11960371B2 (en) 2014-06-04 2024-04-16 Pure Storage, Inc. Message persistence in a zoned system
US9218244B1 (en) 2014-06-04 2015-12-22 Pure Storage, Inc. Rebuilding data across storage nodes
US11399063B2 (en) 2014-06-04 2022-07-26 Pure Storage, Inc. Network authentication for a storage system
US8850108B1 (en) 2014-06-04 2014-09-30 Pure Storage, Inc. Storage cluster
US9213485B1 (en) 2014-06-04 2015-12-15 Pure Storage, Inc. Storage system architecture
US11652884B2 (en) 2014-06-04 2023-05-16 Pure Storage, Inc. Customized hash algorithms
US9367243B1 (en) 2014-06-04 2016-06-14 Pure Storage, Inc. Scalable non-uniform storage sizes
US11886308B2 (en) 2014-07-02 2024-01-30 Pure Storage, Inc. Dual class of service for unified file and object messaging
US10114757B2 (en) 2014-07-02 2018-10-30 Pure Storage, Inc. Nonrepeating identifiers in an address space of a non-volatile solid-state storage
US9021297B1 (en) 2014-07-02 2015-04-28 Pure Storage, Inc. Redundant, fault-tolerant, distributed remote procedure call cache in a storage system
US9836245B2 (en) 2014-07-02 2017-12-05 Pure Storage, Inc. Non-volatile RAM and flash memory in a non-volatile solid-state storage
US11604598B2 (en) 2014-07-02 2023-03-14 Pure Storage, Inc. Storage cluster with zoned drives
US8868825B1 (en) 2014-07-02 2014-10-21 Pure Storage, Inc. Nonrepeating identifiers in an address space of a non-volatile solid-state storage
US9747229B1 (en) 2014-07-03 2017-08-29 Pure Storage, Inc. Self-describing data format for DMA in a non-volatile solid-state storage
US10853311B1 (en) 2014-07-03 2020-12-01 Pure Storage, Inc. Administration through files in a storage system
US8874836B1 (en) 2014-07-03 2014-10-28 Pure Storage, Inc. Scheduling policy for queues in a non-volatile solid-state storage
US9811677B2 (en) 2014-07-03 2017-11-07 Pure Storage, Inc. Secure data replication in a storage grid
US9632890B2 (en) * 2014-07-08 2017-04-25 Netapp, Inc. Facilitating N-way high availability storage services
US9082512B1 (en) 2014-08-07 2015-07-14 Pure Storage, Inc. Die-level monitoring in a storage cluster
US10983859B2 (en) 2014-08-07 2021-04-20 Pure Storage, Inc. Adjustable error correction based on memory health in a storage unit
US9558069B2 (en) 2014-08-07 2017-01-31 Pure Storage, Inc. Failure mapping in a storage array
US9495255B2 (en) 2014-08-07 2016-11-15 Pure Storage, Inc. Error recovery in a storage cluster
US9483346B2 (en) 2014-08-07 2016-11-01 Pure Storage, Inc. Data rebuild on feedback from a queue in a non-volatile solid-state storage
US9766972B2 (en) 2014-08-07 2017-09-19 Pure Storage, Inc. Masking defective bits in a storage array
US10079711B1 (en) 2014-08-20 2018-09-18 Pure Storage, Inc. Virtual file server with preserved MAC address
US10146475B2 (en) * 2014-09-09 2018-12-04 Toshiba Memory Corporation Memory device performing control of discarding packet
US10409769B1 (en) * 2014-09-29 2019-09-10 EMC IP Holding Company LLC Data archiving in data storage system environments
US20160125059A1 (en) 2014-11-04 2016-05-05 Rubrik, Inc. Hybrid cloud data management system
WO2016079786A1 (en) * 2014-11-17 2016-05-26 株式会社日立製作所 Computer system and data processing method
US10469580B2 (en) * 2014-12-12 2019-11-05 International Business Machines Corporation Clientless software defined grid
US10554749B2 (en) 2014-12-12 2020-02-04 International Business Machines Corporation Clientless software defined grid
US9921910B2 (en) 2015-02-19 2018-03-20 Netapp, Inc. Virtual chunk service based data recovery in a distributed data storage system
US9948615B1 (en) 2015-03-16 2018-04-17 Pure Storage, Inc. Increased storage unit encryption based on loss of trust
US11294893B2 (en) 2015-03-20 2022-04-05 Pure Storage, Inc. Aggregation of queries
US9940234B2 (en) 2015-03-26 2018-04-10 Pure Storage, Inc. Aggressive data deduplication using lazy garbage collection
US10082985B2 (en) * 2015-03-27 2018-09-25 Pure Storage, Inc. Data striping across storage nodes that are assigned to multiple logical arrays
US10178169B2 (en) 2015-04-09 2019-01-08 Pure Storage, Inc. Point to point based backend communication layer for storage processing
US9672125B2 (en) 2015-04-10 2017-06-06 Pure Storage, Inc. Ability to partition an array into two or more logical arrays with independently running software
US10140149B1 (en) 2015-05-19 2018-11-27 Pure Storage, Inc. Transactional commits with hardware assists in remote memory
US9817576B2 (en) 2015-05-27 2017-11-14 Pure Storage, Inc. Parallel update to NVRAM
US10977128B1 (en) 2015-06-16 2021-04-13 Amazon Technologies, Inc. Adaptive data loss mitigation for redundancy coding systems
US10846275B2 (en) 2015-06-26 2020-11-24 Pure Storage, Inc. Key management in a storage device
US10394762B1 (en) 2015-07-01 2019-08-27 Amazon Technologies, Inc. Determining data redundancy in grid encoded data storage systems
US10983732B2 (en) 2015-07-13 2021-04-20 Pure Storage, Inc. Method and system for accessing a file
US11232079B2 (en) 2015-07-16 2022-01-25 Pure Storage, Inc. Efficient distribution of large directories
WO2017039580A1 (en) * 2015-08-28 2017-03-09 Hewlett Packard Enterprise Development Lp Collision handling during an asynchronous replication
US10108355B2 (en) 2015-09-01 2018-10-23 Pure Storage, Inc. Erase block state detection
US11341136B2 (en) 2015-09-04 2022-05-24 Pure Storage, Inc. Dynamically resizable structures for approximate membership queries
US11386060B1 (en) 2015-09-23 2022-07-12 Amazon Technologies, Inc. Techniques for verifiably processing data in distributed computing systems
US9768953B2 (en) 2015-09-30 2017-09-19 Pure Storage, Inc. Resharing of a split secret
US10762069B2 (en) 2015-09-30 2020-09-01 Pure Storage, Inc. Mechanism for a system where data and metadata are located closely together
US10853266B2 (en) 2015-09-30 2020-12-01 Pure Storage, Inc. Hardware assisted data lookup methods
US9843453B2 (en) 2015-10-23 2017-12-12 Pure Storage, Inc. Authorizing I/O commands with I/O tokens
US10394789B1 (en) 2015-12-07 2019-08-27 Amazon Technologies, Inc. Techniques and systems for scalable request handling in data processing systems
US10642813B1 (en) 2015-12-14 2020-05-05 Amazon Technologies, Inc. Techniques and systems for storage and processing of operational data
US10324790B1 (en) * 2015-12-17 2019-06-18 Amazon Technologies, Inc. Flexible data storage device mapping for data storage systems
US10007457B2 (en) 2015-12-22 2018-06-26 Pure Storage, Inc. Distributed transactions with token-associated execution
US20170262191A1 (en) * 2016-03-08 2017-09-14 Netapp, Inc. Reducing write tail latency in storage systems
US10592336B1 (en) 2016-03-24 2020-03-17 Amazon Technologies, Inc. Layered indexing for asynchronous retrieval of redundancy coded data
US10061668B1 (en) 2016-03-28 2018-08-28 Amazon Technologies, Inc. Local storage clustering for redundancy coded data storage system
US10678664B1 (en) 2016-03-28 2020-06-09 Amazon Technologies, Inc. Hybridized storage operation for redundancy coded data storage systems
KR101758558B1 (en) * 2016-03-29 2017-07-26 엘에스산전 주식회사 Energy managemnet server and energy managemnet system having thereof
US10496607B2 (en) * 2016-04-01 2019-12-03 Tuxera Inc. Systems and methods for enabling modifications of multiple data objects within a file system volume
US10261690B1 (en) 2016-05-03 2019-04-16 Pure Storage, Inc. Systems and methods for operating a storage system
US10122795B2 (en) * 2016-05-31 2018-11-06 International Business Machines Corporation Consistency level driven data storage in a dispersed storage network
US11861188B2 (en) 2016-07-19 2024-01-02 Pure Storage, Inc. System having modular accelerators
US9672905B1 (en) 2016-07-22 2017-06-06 Pure Storage, Inc. Optimize data protection layouts based on distributed flash wear leveling
US10768819B2 (en) 2016-07-22 2020-09-08 Pure Storage, Inc. Hardware support for non-disruptive upgrades
US11449232B1 (en) 2016-07-22 2022-09-20 Pure Storage, Inc. Optimal scheduling of flash operations
US11080155B2 (en) 2016-07-24 2021-08-03 Pure Storage, Inc. Identifying error types among flash memory
US10216420B1 (en) 2016-07-24 2019-02-26 Pure Storage, Inc. Calibration of flash channels in SSD
US11604690B2 (en) 2016-07-24 2023-03-14 Pure Storage, Inc. Online failure span determination
US11797212B2 (en) 2016-07-26 2023-10-24 Pure Storage, Inc. Data migration for zoned drives
US10366004B2 (en) 2016-07-26 2019-07-30 Pure Storage, Inc. Storage system with elective garbage collection to reduce flash contention
US10203903B2 (en) 2016-07-26 2019-02-12 Pure Storage, Inc. Geometry based, space aware shelf/writegroup evacuation
US11734169B2 (en) 2016-07-26 2023-08-22 Pure Storage, Inc. Optimizing spool and memory space management
US11886334B2 (en) 2016-07-26 2024-01-30 Pure Storage, Inc. Optimizing spool and memory space management
US11422719B2 (en) 2016-09-15 2022-08-23 Pure Storage, Inc. Distributed file deletion and truncation
US11137980B1 (en) 2016-09-27 2021-10-05 Amazon Technologies, Inc. Monotonic time-based data storage
US11281624B1 (en) 2016-09-28 2022-03-22 Amazon Technologies, Inc. Client-based batching of data payload
US11204895B1 (en) 2016-09-28 2021-12-21 Amazon Technologies, Inc. Data payload clustering for data storage systems
US10810157B1 (en) 2016-09-28 2020-10-20 Amazon Technologies, Inc. Command aggregation for data storage operations
US10437790B1 (en) 2016-09-28 2019-10-08 Amazon Technologies, Inc. Contextual optimization for data storage systems
US10496327B1 (en) 2016-09-28 2019-12-03 Amazon Technologies, Inc. Command parallelization for data storage systems
US10657097B1 (en) 2016-09-28 2020-05-19 Amazon Technologies, Inc. Data payload aggregation for data storage systems
US10614239B2 (en) 2016-09-30 2020-04-07 Amazon Technologies, Inc. Immutable cryptographically secured ledger-backed databases
US9747039B1 (en) 2016-10-04 2017-08-29 Pure Storage, Inc. Reservations over multiple paths on NVMe over fabrics
US10756816B1 (en) 2016-10-04 2020-08-25 Pure Storage, Inc. Optimized fibre channel and non-volatile memory express access
US11269888B1 (en) 2016-11-28 2022-03-08 Amazon Technologies, Inc. Archival data storage for structured data
US11550481B2 (en) 2016-12-19 2023-01-10 Pure Storage, Inc. Efficiently writing data in a zoned drive storage system
US11307998B2 (en) 2017-01-09 2022-04-19 Pure Storage, Inc. Storage efficiency of encrypted host system data
US9747158B1 (en) 2017-01-13 2017-08-29 Pure Storage, Inc. Intelligent refresh of 3D NAND
US11955187B2 (en) 2017-01-13 2024-04-09 Pure Storage, Inc. Refresh of differing capacity NAND
US10979223B2 (en) 2017-01-31 2021-04-13 Pure Storage, Inc. Separate encryption for a solid-state drive
US10528488B1 (en) 2017-03-30 2020-01-07 Pure Storage, Inc. Efficient name coding
US11016667B1 (en) 2017-04-05 2021-05-25 Pure Storage, Inc. Efficient mapping for LUNs in storage memory with holes in address space
US10944671B2 (en) 2017-04-27 2021-03-09 Pure Storage, Inc. Efficient data forwarding in a networked device
US10141050B1 (en) 2017-04-27 2018-11-27 Pure Storage, Inc. Page writes for triple level cell flash memory
US10516645B1 (en) 2017-04-27 2019-12-24 Pure Storage, Inc. Address resolution broadcasting in a networked device
US11222076B2 (en) * 2017-05-31 2022-01-11 Microsoft Technology Licensing, Llc Data set state visualization comparison lock
US11128740B2 (en) 2017-05-31 2021-09-21 Fmad Engineering Kabushiki Gaisha High-speed data packet generator
US10990326B2 (en) 2017-05-31 2021-04-27 Fmad Engineering Kabushiki Gaisha High-speed replay of captured data packets
US11392317B2 (en) 2017-05-31 2022-07-19 Fmad Engineering Kabushiki Gaisha High speed data packet flow processing
US10423358B1 (en) 2017-05-31 2019-09-24 FMAD Engineering GK High-speed data packet capture and storage with playback capabilities
US11036438B2 (en) 2017-05-31 2021-06-15 Fmad Engineering Kabushiki Gaisha Efficient storage architecture for high speed packet capture
US11467913B1 (en) 2017-06-07 2022-10-11 Pure Storage, Inc. Snapshots with crash consistency in a storage system
US11947814B2 (en) 2017-06-11 2024-04-02 Pure Storage, Inc. Optimizing resiliency group formation stability
US11782625B2 (en) 2017-06-11 2023-10-10 Pure Storage, Inc. Heterogeneity supportive resiliency groups
US11138103B1 (en) 2017-06-11 2021-10-05 Pure Storage, Inc. Resiliency groups
US10425473B1 (en) 2017-07-03 2019-09-24 Pure Storage, Inc. Stateful connection reset in a storage cluster with a stateless load balancer
US10402266B1 (en) 2017-07-31 2019-09-03 Pure Storage, Inc. Redundant array of independent disks in a direct-mapped flash storage system
US10877827B2 (en) 2017-09-15 2020-12-29 Pure Storage, Inc. Read voltage optimization
US10210926B1 (en) 2017-09-15 2019-02-19 Pure Storage, Inc. Tracking of optimum read voltage thresholds in nand flash devices
US10754557B2 (en) * 2017-09-26 2020-08-25 Seagate Technology Llc Data storage system with asynchronous data replication
US11221920B2 (en) 2017-10-10 2022-01-11 Rubrik, Inc. Incremental file system backup with adaptive fingerprinting
US10884919B2 (en) 2017-10-31 2021-01-05 Pure Storage, Inc. Memory management in a storage system
US10545687B1 (en) 2017-10-31 2020-01-28 Pure Storage, Inc. Data rebuild when changing erase block sizes during drive replacement
US10515701B1 (en) 2017-10-31 2019-12-24 Pure Storage, Inc. Overlapping raid groups
US11024390B1 (en) 2017-10-31 2021-06-01 Pure Storage, Inc. Overlapping RAID groups
US10496330B1 (en) 2017-10-31 2019-12-03 Pure Storage, Inc. Using flash storage devices with different sized erase blocks
US10860475B1 (en) 2017-11-17 2020-12-08 Pure Storage, Inc. Hybrid flash translation layer
US10990566B1 (en) 2017-11-20 2021-04-27 Pure Storage, Inc. Persistent file locks in a storage system
US11372729B2 (en) 2017-11-29 2022-06-28 Rubrik, Inc. In-place cloud instance restore
US10719265B1 (en) 2017-12-08 2020-07-21 Pure Storage, Inc. Centralized, quorum-aware handling of device reservation requests in a storage system
US10929053B2 (en) 2017-12-08 2021-02-23 Pure Storage, Inc. Safe destructive actions on drives
US10929031B2 (en) 2017-12-21 2021-02-23 Pure Storage, Inc. Maximizing data reduction in a partially encrypted volume
US10976948B1 (en) 2018-01-31 2021-04-13 Pure Storage, Inc. Cluster expansion mechanism
US10733053B1 (en) 2018-01-31 2020-08-04 Pure Storage, Inc. Disaster recovery for high-bandwidth distributed archives
US10467527B1 (en) 2018-01-31 2019-11-05 Pure Storage, Inc. Method and apparatus for artificial intelligence acceleration
US11036596B1 (en) 2018-02-18 2021-06-15 Pure Storage, Inc. System for delaying acknowledgements on open NAND locations until durability has been confirmed
US11494109B1 (en) 2018-02-22 2022-11-08 Pure Storage, Inc. Erase block trimming for heterogenous flash memory storage devices
US10931450B1 (en) 2018-04-27 2021-02-23 Pure Storage, Inc. Distributed, lock-free 2-phase commit of secret shares using multiple stateless controllers
US11385792B2 (en) 2018-04-27 2022-07-12 Pure Storage, Inc. High availability controller pair transitioning
US10853146B1 (en) 2018-04-27 2020-12-01 Pure Storage, Inc. Efficient data forwarding in a networked device
US10768844B2 (en) * 2018-05-15 2020-09-08 International Business Machines Corporation Internal striping inside a single device
US11436023B2 (en) 2018-05-31 2022-09-06 Pure Storage, Inc. Mechanism for updating host file system and flash translation layer based on underlying NAND technology
FR3083041B1 (en) * 2018-06-21 2023-05-12 Amadeus Sas SYNCHRONIZATION OF INTER-DEVICE DISPLAYS
US11438279B2 (en) 2018-07-23 2022-09-06 Pure Storage, Inc. Non-disruptive conversion of a clustered service from single-chassis to multi-chassis
US11520514B2 (en) 2018-09-06 2022-12-06 Pure Storage, Inc. Optimized relocation of data based on data characteristics
US11500570B2 (en) 2018-09-06 2022-11-15 Pure Storage, Inc. Efficient relocation of data utilizing different programming modes
US11354058B2 (en) 2018-09-06 2022-06-07 Pure Storage, Inc. Local relocation of data stored at a storage device of a storage system
US11868309B2 (en) 2018-09-06 2024-01-09 Pure Storage, Inc. Queue management for data relocation
US10454498B1 (en) 2018-10-18 2019-10-22 Pure Storage, Inc. Fully pipelined hardware engine design for fast and efficient inline lossless data compression
US10976947B2 (en) 2018-10-26 2021-04-13 Pure Storage, Inc. Dynamically selecting segment heights in a heterogeneous RAID group
US11514000B2 (en) * 2018-12-24 2022-11-29 Cloudbrink, Inc. Data mesh parallel file system replication
US11334254B2 (en) 2019-03-29 2022-05-17 Pure Storage, Inc. Reliability based flash page sizing
US11775189B2 (en) 2019-04-03 2023-10-03 Pure Storage, Inc. Segment level heterogeneity
US11099986B2 (en) 2019-04-12 2021-08-24 Pure Storage, Inc. Efficient transfer of memory contents
US11086821B2 (en) * 2019-06-11 2021-08-10 Dell Products L.P. Identifying file exclusions for write filter overlays
US11714572B2 (en) 2019-06-19 2023-08-01 Pure Storage, Inc. Optimized data resiliency in a modular storage system
US11281394B2 (en) 2019-06-24 2022-03-22 Pure Storage, Inc. Replication across partitioning schemes in a distributed storage system
US11893126B2 (en) 2019-10-14 2024-02-06 Pure Storage, Inc. Data deletion for a multi-tenant environment
US11416144B2 (en) 2019-12-12 2022-08-16 Pure Storage, Inc. Dynamic use of segment or zone power loss protection in a flash device
US11847331B2 (en) 2019-12-12 2023-12-19 Pure Storage, Inc. Budgeting open blocks of a storage unit based on power loss prevention
US11704192B2 (en) 2019-12-12 2023-07-18 Pure Storage, Inc. Budgeting open blocks based on power loss protection
US11188432B2 (en) 2020-02-28 2021-11-30 Pure Storage, Inc. Data resiliency by partially deallocating data blocks of a storage device
US11507297B2 (en) 2020-04-15 2022-11-22 Pure Storage, Inc. Efficient management of optimal read levels for flash storage systems
US11256587B2 (en) 2020-04-17 2022-02-22 Pure Storage, Inc. Intelligent access to a storage device
US11416338B2 (en) 2020-04-24 2022-08-16 Pure Storage, Inc. Resiliency scheme to enhance storage performance
US11474986B2 (en) 2020-04-24 2022-10-18 Pure Storage, Inc. Utilizing machine learning to streamline telemetry processing of storage media
US11768763B2 (en) 2020-07-08 2023-09-26 Pure Storage, Inc. Flash secure erase
US11513974B2 (en) 2020-09-08 2022-11-29 Pure Storage, Inc. Using nonce to control erasure of data blocks of a multi-controller storage system
US11681448B2 (en) 2020-09-08 2023-06-20 Pure Storage, Inc. Multiple device IDs in a multi-fabric module storage system
US11487455B2 (en) 2020-12-17 2022-11-01 Pure Storage, Inc. Dynamic block allocation to optimize storage system performance
US11847324B2 (en) 2020-12-31 2023-12-19 Pure Storage, Inc. Optimizing resiliency groups for data regions of a storage system
US11614880B2 (en) 2020-12-31 2023-03-28 Pure Storage, Inc. Storage system with selectable write paths
US11630593B2 (en) 2021-03-12 2023-04-18 Pure Storage, Inc. Inline flash memory qualification in a storage system
US11507597B2 (en) 2021-03-31 2022-11-22 Pure Storage, Inc. Data replication to meet a recovery point objective
US11832410B2 (en) 2021-09-14 2023-11-28 Pure Storage, Inc. Mechanical energy absorbing bracket apparatus
US11438224B1 (en) 2022-01-14 2022-09-06 Bank Of America Corporation Systems and methods for synchronizing configurations across multiple computing clusters

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5931935A (en) * 1997-04-15 1999-08-03 Microsoft Corporation File system primitive allowing reprocessing of I/O requests by multiple drivers in a layered driver I/O system
US20050015461A1 (en) * 2003-07-17 2005-01-20 Bruno Richard Distributed file system
US20070067332A1 (en) * 2005-03-14 2007-03-22 Gridiron Software, Inc. Distributed, secure digital file storage and retrieval
US20070079082A1 (en) * 2005-09-30 2007-04-05 Gladwin S C System for rebuilding dispersed data
US20070276966A1 (en) * 2006-05-28 2007-11-29 Vipul Paul Managing a device in a distributed file system, using plug & play

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1290287C (en) * 2000-04-17 2006-12-13 北方电讯网络有限公司 Cooperation of ARQ protocols at physical and link layers for wireless communications
US7042869B1 (en) * 2000-09-01 2006-05-09 Qualcomm, Inc. Method and apparatus for gated ACK/NAK channel in a communication system
US8090880B2 (en) * 2006-11-09 2012-01-03 Microsoft Corporation Data consistency within a federation infrastructure
US7363444B2 (en) * 2005-01-10 2008-04-22 Hewlett-Packard Development Company, L.P. Method for taking snapshots of data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5931935A (en) * 1997-04-15 1999-08-03 Microsoft Corporation File system primitive allowing reprocessing of I/O requests by multiple drivers in a layered driver I/O system
US20050015461A1 (en) * 2003-07-17 2005-01-20 Bruno Richard Distributed file system
US20070067332A1 (en) * 2005-03-14 2007-03-22 Gridiron Software, Inc. Distributed, secure digital file storage and retrieval
US20070079082A1 (en) * 2005-09-30 2007-04-05 Gladwin S C System for rebuilding dispersed data
US20070276966A1 (en) * 2006-05-28 2007-11-29 Vipul Paul Managing a device in a distributed file system, using plug & play

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130117679A1 (en) * 2006-06-27 2013-05-09 Jared Polis Aggregation system
US10338853B2 (en) 2008-07-11 2019-07-02 Avere Systems, Inc. Media aware distributed data layout
US10769108B2 (en) 2008-07-11 2020-09-08 Microsoft Technology Licensing, Llc File storage system, cache appliance, and method
US20160261694A1 (en) * 2008-07-11 2016-09-08 Avere Systems, Inc. Method and Apparatus for Tiered Storage
US10248655B2 (en) 2008-07-11 2019-04-02 Avere Systems, Inc. File storage system, cache appliance, and method
US9239840B1 (en) 2009-04-24 2016-01-19 Swish Data Corporation Backup media conversion via intelligent virtual appliance adapter
US20100274784A1 (en) * 2009-04-24 2010-10-28 Swish Data Corporation Virtual disk from network shares and file servers
US20100274886A1 (en) * 2009-04-24 2010-10-28 Nelson Nahum Virtualized data storage in a virtualized server environment
US9087066B2 (en) * 2009-04-24 2015-07-21 Swish Data Corporation Virtual disk from network shares and file servers
US9342528B2 (en) * 2010-04-01 2016-05-17 Avere Systems, Inc. Method and apparatus for tiered storage
US20110246491A1 (en) * 2010-04-01 2011-10-06 Avere Systems, Inc. Method and apparatus for tiered storage
US9389926B2 (en) 2010-05-05 2016-07-12 Red Hat, Inc. Distributed resource contention detection
US9870369B2 (en) 2010-05-05 2018-01-16 Red Hat, Inc. Distributed resource contention detection and handling
US8229961B2 (en) 2010-05-05 2012-07-24 Red Hat, Inc. Management of latency and throughput in a cluster file system
US11418580B2 (en) * 2011-04-01 2022-08-16 Pure Storage, Inc. Selective generation of secure signatures in a distributed storage network
EP2758888A4 (en) * 2011-09-23 2015-01-21 Netapp Inc Storage area network attached clustered storage system
EP2758888A1 (en) * 2011-09-23 2014-07-30 NetApp, Inc. Storage area network attached clustered storage system
US11818212B2 (en) 2011-09-23 2023-11-14 Netapp, Inc. Storage area network attached clustered storage system
US10862966B2 (en) 2011-09-23 2020-12-08 Netapp Inc. Storage area network attached clustered storage system
WO2013043439A1 (en) 2011-09-23 2013-03-28 Netapp, Inc. Storage area network attached clustered storage system
US9569457B2 (en) 2012-10-31 2017-02-14 International Business Machines Corporation Data processing method and apparatus for distributed systems
US9923965B2 (en) 2015-06-05 2018-03-20 International Business Machines Corporation Storage mirroring over wide area network circuits with dynamic on-demand capacity
US10057327B2 (en) 2015-11-25 2018-08-21 International Business Machines Corporation Controlled transfer of data over an elastic network
US10216441B2 (en) 2015-11-25 2019-02-26 International Business Machines Corporation Dynamic quality of service for storage I/O port allocation
US10177993B2 (en) 2015-11-25 2019-01-08 International Business Machines Corporation Event-based data transfer scheduling using elastic network optimization criteria
US9923784B2 (en) 2015-11-25 2018-03-20 International Business Machines Corporation Data transfer using flexible dynamic elastic network service provider relationships
US10581680B2 (en) 2015-11-25 2020-03-03 International Business Machines Corporation Dynamic configuration of network features
US10608952B2 (en) 2015-11-25 2020-03-31 International Business Machines Corporation Configuring resources to exploit elastic network capability
US9923839B2 (en) 2015-11-25 2018-03-20 International Business Machines Corporation Configuring resources to exploit elastic network capability
US9961139B2 (en) * 2016-05-24 2018-05-01 International Business Machines Corporation Cooperative download among low-end devices under resource constrained environment
US10652324B2 (en) * 2016-05-24 2020-05-12 International Business Machines Corporation Cooperative download among low-end devices under resource constrained environment
US20180248938A1 (en) * 2016-05-24 2018-08-30 International Business Machines Corporation Cooperative download among low-end devices under resource constrained environment
US20170346887A1 (en) * 2016-05-24 2017-11-30 International Business Machines Corporation Cooperative download among low-end devices under resource constrained environment
US10642970B2 (en) * 2017-12-12 2020-05-05 John Almeida Virus immune computer system and method
US10817623B2 (en) * 2017-12-12 2020-10-27 John Almeida Virus immune computer system and method
US10614254B2 (en) * 2017-12-12 2020-04-07 John Almeida Virus immune computer system and method
US10970421B2 (en) * 2017-12-12 2021-04-06 John Almeida Virus immune computer system and method
US11132438B2 (en) * 2017-12-12 2021-09-28 Atense, Inc. Virus immune computer system and method
US10929206B2 (en) * 2018-10-16 2021-02-23 Ngd Systems, Inc. System and method for outward communication in a computational storage device
CN113986124A (en) * 2021-10-25 2022-01-28 深信服科技股份有限公司 User configuration file access method, device, equipment and medium
US20230125556A1 (en) * 2021-10-25 2023-04-27 Whitestar Communications, Inc. Secure autonomic recovery from unusable data structure via a trusted device in a secure peer-to-peer data network

Also Published As

Publication number Publication date
US20150067093A1 (en) 2015-03-05
US9880753B2 (en) 2018-01-30
IES20080508A2 (en) 2008-12-10
US20130145105A1 (en) 2013-06-06
US20130151653A1 (en) 2013-06-13

Similar Documents

Publication Publication Date Title
US20150067093A1 (en) Network Distributed File System
JP6132980B2 (en) Decentralized distributed computing system
US20230031079A1 (en) Resynchronization to a synchronous replication relationship
US10289338B2 (en) Multi-class heterogeneous clients in a filesystem
US9442952B2 (en) Metadata structures and related locking techniques to improve performance and scalability in a cluster file system
US9852151B1 (en) Network system to distribute chunks across multiple physical nodes with disk support for object storage
JP4448719B2 (en) Storage system
US20220091771A1 (en) Moving Data Between Tiers In A Multi-Tiered, Cloud-Based Storage System
US9697226B1 (en) Network system to distribute chunks across multiple physical nodes
US8001079B2 (en) System and method for system state replication
US10585599B2 (en) System and method for distributed persistent store archival and retrieval in a distributed computing environment
US20050044162A1 (en) Multi-protocol sharable virtual storage objects
JP2005502096A (en) File switch and exchange file system
IES85057Y1 (en) Network distributed file system
US11221928B2 (en) Methods for cache rewarming in a failover domain and devices thereof
Singhal et al. STUDY ON DIFFERENT TYPE OF DISTRIBUTED FILE SYSTEMS

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENOWARE R&D LIMITED, IRELAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SAWICKI, ANTONI;NOWAK, TOMASZ;REEL/FRAME:021128/0427;SIGNING DATES FROM 20080618 TO 20080619

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION