CN1567237A - Method for constructing high-available distributed memory system - Google Patents

Method for constructing high-available distributed memory system Download PDF

Info

Publication number
CN1567237A
CN1567237A CN 03112402 CN03112402A CN1567237A CN 1567237 A CN1567237 A CN 1567237A CN 03112402 CN03112402 CN 03112402 CN 03112402 A CN03112402 A CN 03112402A CN 1567237 A CN1567237 A CN 1567237A
Authority
CN
China
Prior art keywords
node
data
read
abutting connection
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 03112402
Other languages
Chinese (zh)
Other versions
CN1326045C (en
Inventor
张虎
尹宏伟
王恩东
伍卫国
董小社
钱德沛
庄文君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Langchao Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Langchao Electronic Information Industry Co Ltd filed Critical Langchao Electronic Information Industry Co Ltd
Priority to CNB03112402XA priority Critical patent/CN1326045C/en
Publication of CN1567237A publication Critical patent/CN1567237A/en
Application granted granted Critical
Publication of CN1326045C publication Critical patent/CN1326045C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

This invention is a kind of high availability distributed storing system structuring method that based on parallel file system and distributed file system. The mirror vector ring composes of data storing node in distributed storing system. Then it sets network ID on every storing node in mirror vector ring and copies the data of one node to the adjacent other nodes by adjacent copy technique. When the node has some fault or node adds/decreases as well as availability level changed, it can ensure the high availability, expansibility and dynamic configuration ability by different client-end read-write mechanism. This invention needn't specific hardware system support, and is adapted for cheap cluster system.

Description

Make up the method for high-available distributed storage system
1, technical field
The present invention relates to the Computer Applied Technology field, is a kind of mechanism that improves the high available rate of computer distribution type storage system, specifically is based upon the high available mechanism of the computer distribution type storage system on parallel file system and the distributed file system.
2, technical background
A high-availability system is meant and can also allowing the available system of computing machine continuous firing that when occurring software or hardware fault in the system prior art is to realize by duplicating of system file.If some files are unavailable, its backup file can replace it to work on.High-availability system is made up of two or more nodes usually, and these nodes directly or indirectly link to each other with client by the internet, and each node has this locality storage of oneself or shares identical storage unit.
High-availability system is realized by sharing storage mode or distributed storage mode usually.In sharing storage system, the data storage cell that intrasystem nodes sharing is identical, data synchronization is realized by the lock mechanism that file system provides.Modal design designs for master-slave mode, and in this design, system generally includes a host node and a backup node, main and subordinate node has similar hardware component, identical software and configuration, by network each other with client communication, share identical data storage cell.Backup node monitors host node by heartbeat mechanism, as finding host node because hardware and software failure can not guarantee the operation of using, and backup can be taken over its resource and take over its work, thereby guarantee the continuous service used.
In distributed memory system, system's interior nodes is not shared identical data storage cell, and the high availability of its data is to realize by adding extra storage unit and setting up new data backup mode.So in distributed memory system, setting up efficiently, the data backup mode of science seems particularly important.
3, summary of the invention
Parallel file system and distributed file system that the present invention is intended on distributed memory system provide high available assurance.This distributed memory system is made up of two or more data memory nodes, and each node has local storage unit.Data memory node links to each other with client by direct or indirect physical path, can not have physical path and can have also between memory node.
Order is at first according to the rules formed a mirror image vector ring with data memory node in the storage system, and determines the network identity of each memory node in this mirror image vector ring.Data write and copy rule be according to the difference of file system and difference, and finished by corresponding client access mechanism.If move parallel file system in the system, the data of a file can be divided into a plurality of sticks that are linked in sequence, and deposit successively according to the order of the network identity defined in the loop.And in the operation distributed file system, can determine the destination node that data are duplicated in the reproduction technology by the virtual logical vector ring.
The high available degree of system is to be decided by available levels, we suppose in the storage system, each node average failure probability in certain time interval T is P, storage system interior nodes number is N, be m in abutting connection with what duplicate in abutting connection with distance, in the system that and if only if adjacent greater than, equal m+1 node and lost efficacy simultaneously, we just judge this thrashing.We stipulate that available levels is this system at T addressable always probability in the time like this, and the formula of its calculating is:
P=1-C 1 NP m+1(1-P) n-m-1-C 1 nC 1 n-m-1P m+2(1-P) n-m-2-Λ-C 1 nC n-m n-m-1P n-1(1-P)-P n
Suppose that the grade of available levels is at least p1 in the distributed memory system, can by determine with lower inequality to satisfy condition minimum in abutting connection with the adjacency of duplicating apart from the m value:
P 1≤1-C 1 NP m+1(1-P) n-m-1-C 1 nC 1 n-m-1P m+2(1-P) n-m-2-Λ-C 1 nC n-m n-m-1P n-1(1-P)-P n
The operational management of system is what to realize by the management node of file system with maintenance, management node mainly be responsible for malfunctioning node in the system recovery, increase or reduce node to system, management and maintenance works such as change system available levels, but itself does not provide the file access service.In addition, the high availability of management node can realize by the form of two-node cluster hot backup, when the main management node breaks down, can take over its work from management node.Because the work that management node is done mainly is to communicate with other nodes, operating load is not very big, so the high availability of the management node of two-node cluster hot backup itself can't produce a very large impact the high availability of total system.
In system's normal course of operation, the consistance of data is finished by the read-write mechanism of client, and the read-write mechanism of client is according to the difference of file system and difference.In parallel file system, client directly mails to original memory node to the read request of a file data stick, and normal running.And write request, then according in abutting connection with the adjacency of duplicating apart from m, with data mail to file stick place node and with it the distance forward adjacent node smaller or equal to m on, and the write operation of waiting for all nodes normally returns, just indicate this write operation success, in distributed file system, the read-write of client mechanism is identical with parallel file system.
When node breaks down in the system, at first to carry out the judgement of failure node, when client conducts interviews to the data of certain file on certain node, its visit can not get response in the out-of-service time of regulation, can suspect its inefficacy, retransmit read-write requests one time again,, just think this node failure if still in the out-of-service time, can not get response, and indicate in client, whether each node lost efficacy in certain client independent judgment storage system.After confirming interior certain node failure of storage system, read-write requests to it all can not normal response, it is 1 forward abutment points in abutting connection with distance that client can be transferred to read request it, and the wait abutment points is returned response, if the adjacency distance is 1 forward adjacent node inefficacy, then turn in abutting connection with apart from the forward adjacent node that is 2, be the forward adjacent node of m until the adjacency distance successively, if also lost efficacy for the forward adjacent node of m in abutting connection with distance, then thrashing; Client can be notified management node after judging certain node failure according to the inefficacy judgment mechanism, and in client the sign of inefficacy is set, and can not send write request to this node again, and only waits the response of all the other interdependent nodes by the time promptly to think and operate successfully.
When malfunctioning node joins in the storage system through repairing again, the system of this moment is in the state of operating with failure, at first to recover this node and be adjacent the data synchronization operation of node, the step of data sync is: at first cushion all requests that arrive this node in request buffer, then from its forward adjacent node own original All Files data of reading back, from reverse adjacent node reading of data, set up the Backup Data of reverse adjacent node again.Foundation for reverse adjacent node data is relevant apart from m with the adjacency that current adjacency is duplicated, will with oppositely in abutting connection with distance less than with all reverse adjacent nodes that equal m on raw data be copied to this node, finish the request of access in the aftertreatment buffer zone, until with all data sync.System recovery is to normal operating condition.
When increasing a data memory node in system, the operation of system will be finished by corresponding read-write on client side mechanism.Under parallel file system, at first data need redistribution, mirror image vector ring under this moment original mirror image vector ring and the new interstitial content exists simultaneously, legacy data distributes with original mirror image vector ring, and the management node of file system need be read all legacy datas and distribute with new mirror image vector ring.The redistribution of data realizes with the redistribution of file, and each file has its oneself mirror image vector ring, and the management node of file system need be read all data under the original mirror image vector ring of this document and distribute with new mirror image vector ring.In the DATA DISTRIBUTION process, client is initiated a file access, obtains the metadata information of this document by the accesses meta-data server, can show in the metadata information that this document uses new mirror image vector ring.After the data redistributions, need to rewrite the meta data server relevant information of parallel file system, afterwards client to the read-write requests of this document with directed new mirror image vector ring.Cancel original mirror image vector ring this moment, system recovery is to single mirror image vector ring state.Redistribute if the file of read-write on client side request is current in data redistribution process, will take corresponding reading and writing data lock mechanism to limit its read-write this document.And in data redistribution process, request that continue to add or reduce a node in system will be under an embargo.
Under distributed file system, data do not need redistribution, and newly-increased node needs from its reverse adjacent node copy data, and this node data is copied to its forward adjacent node, rewrites the meta data server relevant information of file system afterwards.
When reducing by a data memory node in system, the operation of system will be finished by corresponding read-write on client side mechanism.Under parallel file system, comprise under system's normal operating condition certain node failure in read-write on client side mechanism when reducing by a node and the system need be it machine-processed two kinds of situations of read-write on client side during from system-kill.When reducing by a node under system's normal operating condition, at first data need redistribution, after the data redistribution is finished, cancel original mirror image vector ring, and system recovery arrives single mirror image vector ring state, and deletes this node from system.Data redistribution mechanism when the data of this process redistribute mechanism and increase a node is identical.When certain node failure needed it from system-kill in system, at first data needed redistribution, and transfer to its forward to the read request of malfunctioning node raw data this moment is the Backup Data of 1 forward adjacent node in abutting connection with distance.After the data redistribution is finished, cancel original mirror image vector ring, system recovery arrives single mirror image vector state, and deletes this node from system.
Under the distribution file system, also comprise under system's normal operating condition certain node failure in read-write on client side mechanism when reducing by a node and the system need be it machine-processed two kinds of situations of read-write on client side during from system-kill.When under system's normal operating condition, reducing by a joint, management node at first according in abutting connection with the adjacency of duplicating apart from m, m reverse adjacent node guaranteeing this node has correct Backup Data on m forward adjacent node of this node, be the raw data of this node that raw data on 1 the forward adjacent node merges at its forward in abutting connection with distance then, and set up the Backup Data of new raw data on m forward adjacent node thereafter.After finishing, this node of deletion from system.Certain node failure need be it during from system-kill in system, management node at first according in abutting connection with the adjacency of duplicating apart from m, m reverse abutment points guaranteeing this malfunctioning node has correct Backup Data on m forward adjacent node of this malfunctioning node, then this malfunctioning node forward in abutting connection with distance be on 1 the adjacent node Backup Data therewith the raw data on the node merge, and set up the Backup Data of new raw data on m forward adjacent node thereafter.After finishing, this node of deletion from system.
Read-write on client side mechanism when increasing or reducing 2 or 2 above nodes in system, its read-write on client side mechanism read-write on client side mechanism with increase or when reducing by a node is identical.In addition, when reducing node in system, minimizing will be restricted to and occur in the system having till the minimum mirror image vector ring of two node configuration.
When changing the available levels of system, at first determine in abutting connection with the adjacency of duplicating apart from m by available levels, by increasing or read-write on client side mechanism when reducing the m value, guarantee the high availability of distributed memory system.When increasing available levels, if increase to m2 in abutting connection with distance by m1 in abutting connection with what duplicate, management node will duplicate data on each node to its forward in abutting connection with distance smaller or equal on m2 and all forward adjacent nodes greater than m1, and the relevant information of rewriting the meta data server of file system.When reducing available levels, if be reduced to m2 in abutting connection with distance by m1 in abutting connection with what duplicate, management node will be deleted with each node forward in abutting connection with distance smaller or equal to m1 and greater than the data on all forward adjacent nodes of m2, and the relevant information of rewriting the meta data server of file system.In the process that changes system's available levels, client will be under an embargo to the read-write requests of system.When system's available levels change finishes, reopen service, allow the read-write requests of client to system.
4, description of drawings
Accompanying drawing 1 is m=1, N=6 mirror image vector ring DATA DISTRIBUTION synoptic diagram.The data backup of each memory node is 1 forward adjacent node at it in abutting connection with distance among the figure.
Accompanying drawing 2 is m=2, N=6 mirror image vector ring DATA DISTRIBUTION diagrammatic sketch.The data backup of each memory node is 1 forward adjacent node in abutting connection with neighborhood distance and is on 2 the forward adjacent node in abutting connection with distance at it among the figure.
Accompanying drawing 3 is m=1, and N=5 expands to N=6 mirror image vector ring data redistribution synoptic diagram.Before three figure represented that respectively node adds, during node added, node added the DATA DISTRIBUTION state in the system of back.Have two mirror image vector rings in the node adition process, wherein the light color ring is original mirror image vector ring, and dark ring is new mirror image vector ring.
Accompanying drawing 4 is N=6, and m=1 changes to m=2 mirror image vector ring DATA DISTRIBUTION synoptic diagram.DATA DISTRIBUTION state in system when two figure represent m=1 respectively up and down and during m=2, wherein dark ring is mirrored storage data newly-increased on former mirror image vector ring distributed data basis.
5, embodiment
Making up high-available distributed storage system can realize based on parallel file system or distribution file system, and next we are example to make up high-availability system on parallel file system, sets forth the detailed process of this system of structure.System constructing mechanism when comprising m=1 and m=2, node break down and expansion of recovery mechanism, node and the relative client read-write mechanism when changing system's available levels.
As shown in Figure 1, in abutting connection with the adjacency of duplicating be 1 o'clock system state apart from m.This system is made up of 6 nodes, and the arrow direction indication is represented the forward of mirror image vector ring, promptly in abutting connection with the positive dirction of duplicating.This locality storage data of this node of solid line circle expression of intranodal, broken circle is represented the data backup of its reverse abutment points at this node.Node constitutes the mirror image vector ring of from 1 to 6 end to end sealing by network identity (being node number).In system's operational process, client directly mails to original memory node to the read request order (1-2-3-4-5-6) in accordance with regulations of a file data stick, and normal running.Write request then mails on file stick place node and its forward adjacent node simultaneously, promptly write respectively once by solid line circle sign order (1-2-3-4-5-6) and broken circle sign order (1-2-3-4-5-6), and the write operation of waiting for all nodes normally returns, and just indicates this write operation success.
When certain node broke down in the system, for example node 3 broke down, and the read-write mechanism of client is taked following rule.At first decision node 3 is a malfunctioning node, can not get response when access node 3 in the out-of-service time of regulation, can suspect its inefficacy, retransmit read-write requests again one time, if still in the out-of-service time, can not get response, just think this node failure, and carry out mark in client.Under node 3 failure states, the read-write requests that client will mail to node 3 is transferred to its forward adjacent node, and promptly node 4, and wait node 4 returns response.
When node 3 joins in the storage system after repairing, the system of this moment is in the state of operating with failure, carry out the data synchronization operation that recovery nodes 3 is adjacent node, at first cushion all requests that arrive this node 3 in request buffer, then from its forward adjacent node 4 own original All Files data of reading back, from reverse adjacent node 2 reading of data, set up the Backup Data of reverse adjacent node again.After finishing, handle the request of access in the buffer zone, until with all data sync.System recovery is to normal condition.
The adjacency that accompanying drawing 2 duplicates for adjacency is 2 o'clock system state synoptic diagram apart from m.This system is made up of 6 nodes, and the arrow direction indication is represented the forward of mirror image vector ring, promptly in abutting connection with the positive dirction of duplicating.This locality storage data of this node of intranodal solid line circle expression, broken circle represent oppositely to be respectively in abutting connection with distance with it the data backup of reverse adjacent node on this node of 1 and 2.Node constitutes the mirror image vector ring of from 1 to 6 end to end sealing by network identity (being node number).In system's operational process, client directly mails to original memory node to the read request order (as 1-2-3-4-5-6) in accordance with regulations of a file data stick, and normal running.It is on 1 the forward adjacent node in abutting connection with distance that write request then mails to file stick place node and its simultaneously, promptly write respectively once by solid line circle sign order (1-2-3-4-5-6) and two broken circle sign orders (1-2-3-4-5-6), and the write operation of waiting for all nodes normally returns, and just indicates this write operation success.
When joining system again after certain node in the system breaks down or recovers, the read-write of the client mechanism situation during with m=1 is identical.If adjacent two nodes lost efficacy in release time in succession in regulation, think promptly and lost efficacy simultaneously that this moment, the read-write mechanism of client can be taked following rule.Lost efficacy simultaneously as node 3 and node 4, at first decision node 3 and node 4 are failure node, and judgment mechanism is identical during with m=1.The read request that this moment, client mail to node 3 and node 4 will be transferred to its adjacent node 5, and wait node 5 returns response.The write request that mails to node 3 then mails to node 5, and wait node 5 returns response, and the write request that mails to node 4 then mails to node 5 and node 6, and the response of wait node 5 and node 6.
When node 3 joins in the storage system after repairing, at first cushion all requests that arrive this node 3 in request buffer, then from its forward abutment points 5 own original All Files data of reading back, from reverse adjacent node 1 and reverse adjacent node 2 reading of data, set up the Backup Data of reverse adjacent node again.After finishing, handle the request of access in the buffer zone, until with all data sync, system recovery is to normal condition.
When node 4 joins in the storage system after repairing, at first cushion all requests that arrive this node 4 in request buffer, then from its forward adjacent node 5 own original All Files data of reading back, from reverse adjacent node 2 and reverse adjacent node 3 reading of data, set up the Backup Data of reverse adjacent node again.If this moment, oppositely adjacent node 3 faults were not also finished reparation, can identify 3 and be malfunctioning node, this moment, the read-write mechanism of client was identical with m=1.After finishing, handle the request of access in the buffer zone, until with all data sync, system recovery is to normal condition.
Fig. 3 is the system state synoptic diagram of system when 5 nodes expand to 6 nodes when the m=1.When in system, adding ingress, can join this node the afterbody of mirror image vector ring network identity usually, can be added in the afterbody of node 5 when node 6 adds system as shown in the figure.The node adition process is divided into following step: before node adds; During node adds and after the node adding.Before node 6 adds, only have a mirror image vector ring in the system, the order that reads of data stick is 1-2-3-4-5.When node 6 joins in the system, the management node of file system will carry out the redistribution work of data this moment, Installed System Memory is at two mirror image vector rings in this process, and promptly original mirror image vector ring and newly-built mirror image vector ring are represented with light color and dark annulus respectively in the drawings.The redistribution of data realizes with the redistribution form of file, and the management node of file system need be read all data under the original mirror image vector ring of this document and distribute with new mirror image vector ring.Read the data stick (the figure solid line circle shown in 1-2-3-4-5) of certain file as management node from former mirror image vector ring, the reorganization file also distributes (1-2-3-4-5-6 of solid line circle expression among the figure) with new mirror image vector ring, duplicates new data stick simultaneously to its forward adjacent node (1-2-3-4-5-6 that broken circle is represented among the figure).In the DATA DISTRIBUTION process, client will be initiated the visit to certain file, obtain the metadata information of this document by the accesses meta-data server, can show in the metadata information that it still is original mirror image vector ring that this document is used new mirror image vector ring, client read-write operation thereafter all can carry out in the mirror image vector ring of metadata indication.After data heavily distribute, need to rewrite the meta data server relevant information of parallel file system, afterwards client to the read-write requests of this document with directed new mirror image vector ring.Cancel original mirror image vector ring this moment, system recovery is to single mirror image vector ring state.Redistribute if the file of read-write on client side request is current in data redistribution process, will take corresponding reading and writing data lock mechanism to limit its read-write this document.After pending data all redistributes and finishes, the management node of file system will be cancelled original mirror image vector ring, and system recovery is to single mirror image vector ring state.The process of whole DATA DISTRIBUTION can be finished by the process that operates in the backstage, and the priority of this process can be set to minimum, to guarantee the externally quality of service.
Accompanying drawing 4 is N=6, and m=1 changes to m=2 mirror image vector ring DATA DISTRIBUTION synoptic diagram.DATA DISTRIBUTION state when wherein last figure represents m=1 in the system, the DATA DISTRIBUTION state when figure below is represented m=2 in the system.When changing the available levels of system, at first determine in abutting connection with the adjacency of duplicating apart from m by available levels, by making m be increased to 2 available levels that improve system from 1, promptly improve the high availability of system as shown in the figure.In the process that changes system's available levels, client will be under an embargo to the read-write requests of system.
When m=1, intrasystem DATA DISTRIBUTION state as shown above, it is that promptly node 2 has node 1 data backup on 1 the forward adjacent node in abutting connection with distance that the Backup Data of each node is present in it, node 3 has the data backup of node 2, or the like.When m=2, the data backup of each node should be present in it and is 1 and is on the adjacent node of 2 forward in abutting connection with distance in abutting connection with distance in the system, so management node can be read the total data of certain node and copy to that to be adjacent distance be on 2 the forward adjacent node, as shown below, the data of node 1 copy to node 3, the data of node 2 copy to node 4, by that analogy.When data are duplicated when finishing, the relevant information of rewriting the meta data server of file system is so that take new available levels during system's operation.System's available levels is reopened service after rewriting and finishing, and allows the read-write requests of client to system.If change available levels so that be reduced to 1 apart from m from 2 in abutting connection with the adjacency of duplicating, it is identical that its operating mechanism and m increase to 2 process from 1, management node will delete with each node forward be data on 2 the forward adjacent node in abutting connection with distance, and the relevant information of rewriting the meta data server of file system.
Method of the present invention is applicable to most network file system(NFS), through transformation to network file system(NFS), can on a relatively inexpensive storage group of planes, provide and have only the available high availability of dedicated system to guarantee at present, simultaneously, use the storage system of this technology, its available levels can be configured as required, can be widely used in the very high network storage environment of data availability requirement.
Do not need special-purpose hardware system support because method of the present invention is used, can be applicable in the general network storage group of planes, this just is fit to most of cheap Network of Workstation, and cost performance is higher.Thereby, apply method of the present invention and can produce very high economic benefit.Also can change simultaneously the phenomenon that high availability support in the present high-end applications relies on foreign technology substantially.

Claims (13)

1, makes up the method for high-available distributed storage system, it is characterized in that the data memory node in the distributed memory system is formed the mirror image vector ring in order, and on each memory node in the mirror image vector ring network identity is set, utilize simultaneously in abutting connection with reproduction technology the data of a node are copied to its adjacent node, when node breaks down or the increase/minimizing of node and when changing available levels, guarantee high availability, extensibility and the dynamic configurable ability of distributed memory system by different read-write on client side mechanism.
2, method according to claim 1, it is characterized in that at the mirror image vector ring be by giving the unique network identity of each data memory node in the internet that is interconnected, in certain sequence all identification strings are linked to be an end to end chain, and take different organizational forms according to the difference that adopts file system, this organizational form comprises the mirror image vector ring under the mirror image vector ring and distributed file system under the parallel file system, the direction that network identity is arranged in order in the mirror image vector ring is a forward, the direction of the adjacency of node and mirror image vector ring is consistent to be that forward is in abutting connection with direction, otherwise in abutting connection with direction, no matter which kind of direction is in abutting connection with guaranteeing that data memory node and client in the internet have direct or indirect physical path for oppositely.
3, method according to claim 2, it is characterized in that a file data in the mirror image vector ring can be divided into a plurality of sticks that are linked in sequence under the parallel file system, and the order according to the network identity defined in this loop leaves on each node successively, and the mirror image vector ring only is used for determining the destination node location in the data reproduction process under the distributed file system.
4, method according to claim 1, it is characterized in that utilizing in abutting connection with a kind of data backup mode of reproduction technology for distributed memory system, determine in abutting connection with the adjacency of duplicating apart from m according to available levels, select File or file stick copy to its place node be on all of its neighbor node of forward in abutting connection with distance smaller or equal to m's and in abutting connection with direction, and guarantee the consistance of each copy data and original data built in the file access stage by the read-write heads of client, the read request of a file data stick is directly mail to original memory node and normal running, write request is then at first determined in abutting connection with the adjacency of duplicating apart from m according to its available levels, mail to then file stick place node and with it the distance all the forward adjacent nodes smaller or equal to m on, and the write operation of waiting for all nodes normally returns, and just indicates this write operation success.
5, method according to claim 1, what it is characterized in that available levels determines it is in distributed memory system, each node mean failure rate in certain time interval T is P, storage system interior nodes number is N, in abutting connection with duplicate in abutting connection with distance for m, when adjacent when losing efficacy simultaneously in the system, just make this thrashing more than or equal to m+1 node, stipulate that thus available levels is this system at T addressable always probability in the time, the formula of its calculating is:
P=1-C 1 NP M+1(1-P) N-m-1-C 1 nC 1 N-m-1P M+2(1-P) N-m-2-Λ-C 1 nC N-m N-m-1P N-1(1-P)-P nSuppose that the grade of available levels is at least p1 in the distributed memory system, can by determine with lower inequality to satisfy condition minimum in abutting connection with the adjacency of duplicating apart from the m value:
P 1≤1-C 1 NP m+1(1-P) n-m-1-C 1 nC 1 n-m-1P m+2(1-P) n-m-2-Λ-C 1 nC n-m n-m-1P n-1(1-P)-P n
6, method according to claim 1, it is characterized in that the read-write on client side mechanism when node breaks down, the determination strategy of node failure and the reading and writing data mechanism behind the node failure when comprising client-access, be that client is when conducting interviews to the data of certain file on certain node, in the out-of-service time of regulation, can not get response, can suspect its inefficacy, retransmit read-write requests again one time, in the out-of-service time, respond as still can not get, confirm this node failure, the management node of circular document system, and identify in client.The management node of the file system in the distributed memory system, be responsible for malfunctioning node in the system recovery, increase or reduce node, change management such as system's available levels and safeguard that itself does not provide the file access service to system.Behind node failure in the reading and writing data mechanism, client no longer sends read-write requests to malfunctioning node, and the direct forward adjacent node that read-write requests is sent to this malfunctioning node, when node adjacency during apart from m=1, it is 1 adjacent node in abutting connection with distance that client can be transferred to read request its forward, and wait for that it returns response, when node adjacency apart from m more than or equal to 2 the time, client can send to write request the forward adjacency of this malfunctioning node also in abutting connection with the node of distance smaller or equal to m, and read request will to transfer to its forward be 1 adjacent node in abutting connection with distance, and wait for that it returns response, if this node failure, then turn to adjacency apart from the forward adjacent node that is 2, be the forward adjacent node of m until the adjacency distance successively, when needing to join in the storage system after the malfunctioning node reparation, at first to carry out the data synchronization operation that recovery nodes is adjacent node, enter into the read-write on client side mechanism state of the normal operation of system then.
7, method according to claim 6 is characterized in that, the data sync operation of recovery nodes and adjacent node, be that the write request that earlier all is arrived this malfunctioning node is buffered in the request buffer, then from its forward adjacent node own original All Files data of reading back, again according in abutting connection with duplicate in abutting connection with the distance from reverse adjacent node reading of data, set up the Backup Data of reverse adjacent node, if in abutting connection with duplicate in abutting connection with distance for m, will be oppositely during operation in abutting connection with distance less than with the reverse adjacent node that equals m on raw data be copied to this node.
8, according to the described method of claim 1, it is characterized in that in the read-write on client side mechanism when node increase/minimizing, be divided into read-write on client side mechanism under read-write on client side mechanism and the distributed file system under the parallel file system according to the difference of file system type, wherein the mechanism of the read-write on client side under the parallel file system is when increasing one or more data memory node in the system, at first data are redistributed, cancel original mirror image vector ring, allow storage system return to that the redistribution with file realizes under the state of single mirror image vector ring, each file has its oneself mirror image vector ring, the management node of file system need be read all data under the original mirror image vector ring of this document and distribute with new mirror image vector ring, and this moment, original mirror image was vowed
Mirror image vector ring under amount ring and the new interstitial content exists simultaneously, legacy data distributes with original mirror image vector ring, the management node of file system need be read all legacy datas and distribute with new mirror image vector ring, after the data redistribution of file, need to rewrite the meta data server relevant information of parallel file system, afterwards client to the read-write requests of this document with directed new mirror image vector ring.
9, method according to claim 8, it is characterized in that the client initiation is to a file access, at first obtain the metadata information of this document by the accesses meta-data server, can show in the metadata information that this document uses new mirror image vector ring and still use original mirror image vector ring, redistribute if the file of read-write on client side request is current in data redistribution process, to take corresponding reading and writing data lock mechanism to limit its read-write to this document, in data redistribution process, in case the reading and writing data lock mechanism starts, the request that continues interpolation or minimizing node in system will be under an embargo.
10, described method according to Claim 8-9, it is characterized in that in the read-write on client side mechanism under distributed file system, data do not need redistribution, newly-increased node needs from its reverse adjacent node copy data, and this node data copied to its forward adjacent node, rewrite the meta data server relevant information of file system afterwards, when the read-write on client side when operation reduces node under parallel file system is machine-processed, system moves the read-write on client side mechanism when reducing node under normal condition, at first data need redistribution, after the data redistribution is finished, cancel original mirror image vector ring, system recovery is to single mirror image vector ring state, and from system, delete failure node, when data redistribute, the read request of malfunctioning node raw data will be transferred to its forward is in the Backup Data of 1 forward adjacent node in abutting connection with distance, when the read-write on client side when system's operation reduces node was machine-processed, the node minimizing was restricted to and occurs in the system having till the minimum mirror image vector ring of two node configuration.
11, described method according to Claim 8-10, it is characterized in that the read-write on client side mechanism when operation reduces node under the distributed file system normal condition, the adjacency that management node duplicates according to adjacent node is apart from m, m reverse adjacent node guaranteeing this node has correct Backup Data on m forward adjacent node of this node, be that raw data on 1 the forward adjacent node merges in abutting connection with distance then the raw data of this node and its forward, and set up the Backup Data of new raw data on m forward adjacent node thereafter, after Backup Data was finished, this node was deleted from system.
12, described method according to Claim 8-11, it is characterized in that in the system read-write on client side mechanism when certain malfunctioning node need move this node from system-kill when losing efficacy, this moment management node according in abutting connection with the adjacency of duplicating apart from m, m reverse adjacent node guaranteeing this malfunctioning node has correct Backup Data on m forward adjacent node of this node, then this malfunctioning node forward in abutting connection with distance be on 1 the adjacent node Backup Data therewith the raw data on the node merge, and set up the Backup Data of new raw data on m forward adjacent node thereafter, after Backup Data was finished, this malfunctioning node was deleted from system.
13, method according to claim 1, read-write on client side mechanism when it is characterized in that changing available levels, be that available levels is determined in abutting connection with the adjacency of duplicating apart from m, read-write on client side mechanism during by increase or minimizing m value guarantees the high availability of distributed memory system, if increase to m2 in abutting connection with distance by m1 in abutting connection with what duplicate, management node will duplicate on each node data to its forward in abutting connection with the distance smaller or equal on m2 and all forward adjacent nodes greater than m1, and the relevant information of rewriting the meta data server of file system, if be reduced to m2 in abutting connection with distance by m1 in abutting connection with what duplicate, management node will delete with each node forward in abutting connection with the distance smaller or equal to m1 and greater than the data on all forward adjacent nodes of m2, and the relevant information of rewriting the meta data server of file system, in the process that changes system's available levels, client will be under an embargo to the read-write requests of system.
CNB03112402XA 2003-06-09 2003-06-09 Method for constructing high-available distributed memory system Expired - Fee Related CN1326045C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB03112402XA CN1326045C (en) 2003-06-09 2003-06-09 Method for constructing high-available distributed memory system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB03112402XA CN1326045C (en) 2003-06-09 2003-06-09 Method for constructing high-available distributed memory system

Publications (2)

Publication Number Publication Date
CN1567237A true CN1567237A (en) 2005-01-19
CN1326045C CN1326045C (en) 2007-07-11

Family

ID=34468913

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB03112402XA Expired - Fee Related CN1326045C (en) 2003-06-09 2003-06-09 Method for constructing high-available distributed memory system

Country Status (1)

Country Link
CN (1) CN1326045C (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101888398A (en) * 2009-05-20 2010-11-17 中国科学院声学研究所 Data storage method based on network storage structure of (d, k) mole diagram
CN101188569B (en) * 2006-11-16 2011-05-04 饶大平 Method for constructing data quanta space in network and distributed file storage system
CN102106125A (en) * 2008-07-25 2011-06-22 格诺多有限公司 A multi-path network
CN102122306A (en) * 2011-03-28 2011-07-13 中国人民解放军国防科学技术大学 Data processing method and distributed file system applying same
CN102622412A (en) * 2011-11-28 2012-08-01 中兴通讯股份有限公司 Method and device of concurrent writes for distributed file system
CN102640125A (en) * 2009-09-21 2012-08-15 特兰斯拉蒂斯公司 Distributed content storage and retrieval
CN101562543B (en) * 2009-05-25 2013-07-31 阿里巴巴集团控股有限公司 Cache data processing method and processing system and device thereof
CN103516734A (en) * 2012-06-20 2014-01-15 阿里巴巴集团控股有限公司 Data processing method, device and system
CN103973497A (en) * 2014-05-23 2014-08-06 浪潮电子信息产业股份有限公司 Method and device for achieving multi-channel concurrent storage based on high-density micro-servers
CN104202434A (en) * 2014-09-28 2014-12-10 北京奇虎科技有限公司 Node access method and device
CN104639661A (en) * 2015-03-13 2015-05-20 华存数据信息技术有限公司 Distributed storage system and storing and reading method for files
WO2016065612A1 (en) * 2014-10-31 2016-05-06 华为技术有限公司 Method, system, and host for accessing files
CN105847855A (en) * 2016-05-13 2016-08-10 天脉聚源(北京)传媒科技有限公司 Program processing method and system
CN106527982A (en) * 2016-10-25 2017-03-22 西安交通大学 Object distribution algorithm for object storage system consisting of heterogeneous storage devices
CN107357689A (en) * 2017-08-02 2017-11-17 郑州云海信息技术有限公司 The fault handling method and distributed memory system of a kind of memory node
CN108513658A (en) * 2016-12-30 2018-09-07 华为技术有限公司 A kind of transaction methods and device
CN109407981A (en) * 2018-09-28 2019-03-01 深圳市茁壮网络股份有限公司 A kind of data processing method and device
CN110019065A (en) * 2017-09-05 2019-07-16 阿里巴巴集团控股有限公司 Processing method, device and the electronic equipment of daily record data
CN110901691A (en) * 2018-09-17 2020-03-24 株洲中车时代电气股份有限公司 Ferroelectric data synchronization system and method and train network control system
WO2020063424A1 (en) * 2018-09-27 2020-04-02 北京白山耘科技有限公司 Distributed data system, distributed data synchronization method, computer storage medium, and computer device
CN113157492A (en) * 2021-04-07 2021-07-23 北京思特奇信息技术股份有限公司 Backup method, recovery method and backup system of distributed database

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346779B (en) * 2011-10-18 2013-07-31 中国联合网络通信集团有限公司 Distributed file system and master control node backup method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08195767A (en) * 1995-01-19 1996-07-30 Fuji Electric Co Ltd Method for monitoring backup ring
US6260069B1 (en) * 1998-02-10 2001-07-10 International Business Machines Corporation Direct data retrieval in a distributed computing system
US6990606B2 (en) * 2000-07-28 2006-01-24 International Business Machines Corporation Cascading failover of a data management application for shared disk file systems in loosely coupled node clusters

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101188569B (en) * 2006-11-16 2011-05-04 饶大平 Method for constructing data quanta space in network and distributed file storage system
CN102106125A (en) * 2008-07-25 2011-06-22 格诺多有限公司 A multi-path network
CN102106125B (en) * 2008-07-25 2016-01-27 克雷Uk有限公司 A kind of multi-path network
US8898431B2 (en) 2008-07-25 2014-11-25 Cray HK Limited Multi-path network
CN101888398A (en) * 2009-05-20 2010-11-17 中国科学院声学研究所 Data storage method based on network storage structure of (d, k) mole diagram
CN101888398B (en) * 2009-05-20 2012-11-21 中国科学院声学研究所 Data storage method based on network storage structure of (d, k) mole diagram
US8972773B2 (en) 2009-05-25 2015-03-03 Alibaba Group Holding Limited Cache data processing using cache cluster with configurable modes
CN101562543B (en) * 2009-05-25 2013-07-31 阿里巴巴集团控股有限公司 Cache data processing method and processing system and device thereof
CN102640125A (en) * 2009-09-21 2012-08-15 特兰斯拉蒂斯公司 Distributed content storage and retrieval
CN102640125B (en) * 2009-09-21 2015-07-08 高通股份有限公司 Distributed content storage and retrieval
CN102122306A (en) * 2011-03-28 2011-07-13 中国人民解放军国防科学技术大学 Data processing method and distributed file system applying same
CN102622412A (en) * 2011-11-28 2012-08-01 中兴通讯股份有限公司 Method and device of concurrent writes for distributed file system
CN103516734B (en) * 2012-06-20 2018-01-12 阿里巴巴集团控股有限公司 Data processing method, equipment and system
CN103516734A (en) * 2012-06-20 2014-01-15 阿里巴巴集团控股有限公司 Data processing method, device and system
CN103973497A (en) * 2014-05-23 2014-08-06 浪潮电子信息产业股份有限公司 Method and device for achieving multi-channel concurrent storage based on high-density micro-servers
CN104202434A (en) * 2014-09-28 2014-12-10 北京奇虎科技有限公司 Node access method and device
WO2016065612A1 (en) * 2014-10-31 2016-05-06 华为技术有限公司 Method, system, and host for accessing files
CN104639661A (en) * 2015-03-13 2015-05-20 华存数据信息技术有限公司 Distributed storage system and storing and reading method for files
CN105847855A (en) * 2016-05-13 2016-08-10 天脉聚源(北京)传媒科技有限公司 Program processing method and system
CN106527982B (en) * 2016-10-25 2019-04-12 西安交通大学 A kind of object distribution algorithm for the object storage system being made of heterogeneous storage devices
CN106527982A (en) * 2016-10-25 2017-03-22 西安交通大学 Object distribution algorithm for object storage system consisting of heterogeneous storage devices
CN108513658A (en) * 2016-12-30 2018-09-07 华为技术有限公司 A kind of transaction methods and device
US11176086B2 (en) 2016-12-30 2021-11-16 Huawei Technologies Co., Ltd. Parallel copying database transaction processing
CN108513658B (en) * 2016-12-30 2022-02-25 华为技术有限公司 Transaction processing method and device
CN107357689A (en) * 2017-08-02 2017-11-17 郑州云海信息技术有限公司 The fault handling method and distributed memory system of a kind of memory node
CN107357689B (en) * 2017-08-02 2020-09-08 郑州云海信息技术有限公司 Fault processing method of storage node and distributed storage system
CN110019065A (en) * 2017-09-05 2019-07-16 阿里巴巴集团控股有限公司 Processing method, device and the electronic equipment of daily record data
CN110019065B (en) * 2017-09-05 2023-05-05 阿里巴巴集团控股有限公司 Log data processing method and device and electronic equipment
CN110901691A (en) * 2018-09-17 2020-03-24 株洲中车时代电气股份有限公司 Ferroelectric data synchronization system and method and train network control system
WO2020063424A1 (en) * 2018-09-27 2020-04-02 北京白山耘科技有限公司 Distributed data system, distributed data synchronization method, computer storage medium, and computer device
CN109407981A (en) * 2018-09-28 2019-03-01 深圳市茁壮网络股份有限公司 A kind of data processing method and device
CN113157492A (en) * 2021-04-07 2021-07-23 北京思特奇信息技术股份有限公司 Backup method, recovery method and backup system of distributed database

Also Published As

Publication number Publication date
CN1326045C (en) 2007-07-11

Similar Documents

Publication Publication Date Title
CN1567237A (en) Method for constructing high-available distributed memory system
US7523110B2 (en) High availability designated winner data replication
US5423037A (en) Continuously available database server having multiple groups of nodes, each group maintaining a database copy with fragments stored on multiple nodes
US9785498B2 (en) Archival storage and retrieval system
US8214334B2 (en) Systems and methods for distributed system scanning
US7120651B2 (en) Maintaining a shared cache that has partitions allocated among multiple nodes and a data-to-partition mapping
RU2208834C2 (en) Method and system for recovery of database integrity in system of bitslice databases without resource sharing using shared virtual discs and automated data medium for them
US6934725B1 (en) Management of file extent mapping to hasten mirror breaking in file level mirrored backups
CN101334797B (en) Distributed file systems and its data block consistency managing method
US8010514B2 (en) System and method for a distributed object store
CN1320483C (en) System and method for implementing journaling in a multi-node environment
US5555404A (en) Continuously available database server having multiple groups of nodes with minimum intersecting sets of database fragment replicas
CN101697168B (en) Method and system for dynamically managing metadata of distributed file system
US9396202B1 (en) Weakly synchronized garbage collection and compaction for aggregated, replicated object stores
US7779128B2 (en) System and method for perennial distributed back up
US7979441B2 (en) Method of creating hierarchical indices for a distributed object system
RU2372649C2 (en) Granular control of authority of duplicated information by means of restriction and derestriction
CN101577735B (en) Method, device and system for taking over fault metadata server
US9141480B2 (en) Handling failed transaction peers in a distributed hash table
US20200050387A1 (en) Fast migration of metadata
EP0722236B1 (en) System and method for fault tolerant key management
Goodrich et al. The rainbow skip graph: a fault-tolerant constant-degree distributed data structure
CN105359099A (en) Index update pipeline
CN105049258A (en) Data transmission method of network disaster-tolerant system
CN1299203C (en) Data disaster tocerance backup control system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070711

Termination date: 20160609

CF01 Termination of patent right due to non-payment of annual fee