US20090063807A1 - Data redistribution in shared nothing architecture - Google Patents

Data redistribution in shared nothing architecture Download PDF

Info

Publication number
US20090063807A1
US20090063807A1 US11/847,270 US84727007A US2009063807A1 US 20090063807 A1 US20090063807 A1 US 20090063807A1 US 84727007 A US84727007 A US 84727007A US 2009063807 A1 US2009063807 A1 US 2009063807A1
Authority
US
United States
Prior art keywords
partition
changes
sending
data
hard disk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/847,270
Inventor
Philip Shawn Cox
Leo T.M. Lau
Adil Mohammad Sardar
David Tremaine
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/847,270 priority Critical patent/US20090063807A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: TREMAINE, DAVID, COX, PHILIP SHAWN, LAU, LEO T.M., SARDAR, ADIL MOHAMMAD
Priority to US12/274,741 priority patent/US9672244B2/en
Publication of US20090063807A1 publication Critical patent/US20090063807A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Definitions

  • the present invention relates to computer systems, and more particularly to a method and system for data redistribution.
  • One problem with this operation is that requirement of logging data movement requires a substantial amount of memory space to store the logged data. For example, if one machine is added to a two-machine cluster, 33% of the data of the entire database will need to be deleted on the existing machines and 33% will need to be added on the new machine. This significant amount of data logging not only impacts performance, but also introduces a significant burden on setting up the active logging space and its corresponding backup/restore system.
  • the method includes dividing data into batches at a sending partition, wherein the data is to be redistributed to a receiving partition, wherein each batch comprises a plurality of first pages and first control information, wherein the plurality of first pages includes changes to a memory, and wherein the control information is used to restart a distribution process in the event of a failure.
  • the method further includes populating a first data structure with the first pages and the first control information in a first data structure.
  • the method further includes storing the first data structure in a cache at the sending partition, wherein the changes are not stored to a first hard disk at the sending partition until after the changes are successfully stored at the receiving partition.
  • the method further includes sending the changes over the network to the receiving partition, wherein the receiving partition populates a second data structure with second pages and second control information, where the plurality of second pages includes the changes, and wherein the changes are subsequently stored in a second hard disk at the receiving partition.
  • the method further includes receiving a notification that the changes have been successfully stored in the second hard disk at the receiving partition.
  • the method further includes storing, in response to the notification, the changes on the first hard disk at the sending partition. According to the system and method disclosed herein, unnecessary logging is avoided and required memory space is minimized.
  • FIG. 1 is a block diagram of a redistribution system in accordance with one embodiment of the present invention.
  • FIG. 2 is a flow chart showing a method for redistributing data in accordance with one embodiment of the present invention.
  • FIG. 3 is a block diagram of a redistribution system in accordance with one embodiment of the present invention.
  • the present invention relates to computer systems, and more particularly to a method and system for data redistribution.
  • the following description is presented to enable one of ordinary skill in the art to make and use the invention, and is provided in the context of a patent application and its requirements.
  • Various modifications to the preferred embodiment and the generic principles and features described herein will be readily apparent to those skilled in the art.
  • the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features described herein.
  • a system and method in accordance with the present invention for redistributing data have been disclosed.
  • the method includes dividing data that is to be redistributed to batches, where each batch includes pages that include changes to memory and corresponding control information.
  • the batches are stored in a cache at a sending partition until the batches are stored in a hard disk at a receiving partition. After the batches are successfully stored in the hard disk at the receiving partition, the batches are stored in a hard disk at the sending partition. If there is a failure in the process, the process may restart by reprocessing the batch that was being processed during the failure. As a result, unnecessary logging is avoided, and required memory space is minimized.
  • FIG. 1 is a block diagram of a redistribution system 100 in accordance with one embodiment of the present invention.
  • FIG. 1 shows a sending system or sending partition 102 and a receiving system or receiving partition 104 .
  • the sending partition 102 includes a processor 106 , a redistribution application 108 that is executed by the processor 106 and stored in a memory.
  • the sending partition 102 also includes a database management hard disk 110 , consistency points 114 , pages 116 , control information 118 , and a memory cache 120 .
  • the receiving partition 104 includes a processor 126 , a redistribution application 128 that is executed by the processor 126 and stored in a memory.
  • the receiving partition also includes a database management hard disk 130 , a file system hard disk 132 , one or more consistency points 134 , pages 136 , and control information 138 .
  • a database management hard disk 130 For ease of illustration, only one consistency point 114 is shown at the sending partition 102 , and only one consistency point 134 is shown at the sending partition 104 .
  • the redistribution applications 108 and 128 are operable to divide data into multiple consistency points 114 and 134 , as described below.
  • FIG. 2 is a flow chart showing a method for redistributing data in accordance with one embodiment of the present invention.
  • the redistribution application 108 divides the data that is to be redistributed deterministically into batches, also referred to as consistency points 114 , based on invariant database characteristics.
  • the data is divided deterministically into batches, because redo operations may require the same batch size and pages.
  • the database structure may imply the type of table.
  • the redistribution application 108 may select different batch sizes based on how the table which stored the data was created.
  • a consistency point is data structure that ties together a series of changes in memory (e.g., pages 116 ) and corresponding logical structure changes (e.g., control information 118 ).
  • the consistency point 114 shown in FIG. 1 is the very first consistency point and includes modified or “dirty” pages 116 and corresponding control information 118 .
  • the redistribution application 108 works on the table data stored on the database management hard disk 110 , the redistribution application 108 populates the consistency point 114 with the pages and control information (numbered circle 1 ).
  • the redistribution application 108 may store each consistency point 114 in separate consistency point files.
  • the redistribution application 108 stores the changes (e.g., pages 116 ) on the sending machine in the memory cache 120 at the sending partition 102 .
  • the redistribution application does not store the pages (i.e., changes) to the database management hard disk 110 until after the redistribution application 128 at the receiving partition 104 has successfully stored all its changes (e.g., pages 136 ) for consistency point 134 to a database management hard disk 130 .
  • the redistribution application 108 sends the changes over the network to the receiving partition 104 (numbered circle 2 ), and the redistribution application 128 populates the consistency point 134 with pages 136 which include the changes, and with control information 128 (numbered circle 3 ). This process continues until the redistribution application 108 at the sending partition 102 determines that enough data has been processed. Once enough data has been processed, in step 206 , the redistribution application 128 at the receiving partition 104 initiates a procedure to commit or store the changes to the database management hard disk 130 . In one embodiment, control information 128 is first written to the file system hard disk 132 . This ensures that the control information has been written for the undo operation in the event of a failure.
  • the redistribution application 108 may immediately begin writing the changed pages to disk.
  • the sending partition 102 before the sending partition 102 sends data, the sending partition needs to update the control information 118 , and ensure that the control information 118 is written to file system hard disk 112 .
  • redistribution application 128 at the receiving partition 104 stores the changes on database management hard disk 130 at the receiving partition 104 .
  • the receiving partition 104 may receive and process data from multiple consistency points (e.g., current consistency points being processed on different sending partitions) in parallel so that the logging scheme is minimized and does not impede concurrency or performance.
  • the redistribution application 128 at the receiving partition 104 notifies the redistribution application 108 at the sending partition 102 that the receiving partition 104 has successfully stored the changes on the database management hard disk 130 at the receiving partition 104 .
  • the redistribution application 108 stores the changes on the database management hard disk 110 via the memory cache 120 at the sending partition 102 . The redistribution application 108 may then empty the memory cache 120 .
  • control file there is one control file per consistency point, and once the consistency point is complete (e.g., all of the changes to the data in the consistency point are written to the database management hard disk 110 at the sending partition 102 and to the database management hard disk 130 at the receiving partition 104 ), the corresponding control files with the control information 118 and 128 may be removed from the consistency point 114 so that the control information does not accumulate and thus freeing up memory space.
  • the data redistribute operation is restartable in that if the redistribute operation fails due to a problem such as a resource error (e.g., not enough hard disk space, memory, log space, etc.), the redistribution application 108 may simply restart the data redistribution process where it left off without needing to restore the database from the beginning. For example, the redistribution application 108 may restart at the batch after the batch that was last successfully stored at the receiving partition 104 . In one embodiment, for each data page being processed, the redistribution application stores a relatively small amount of control information in a control file (CF) at both the sending and receiving partitions 102 and 104 during runtime.
  • CF control file
  • control information includes the page ID and the number of rows changed in each page.
  • the page ID and the number of rows changed in each page provide a reference point that the redistribution application 108 may use to restart the data redistribution process where it left off before the failure. As such, there is no need to store user data from the control file. Storing only the control information in the control file minimizes the amount of information saved on both the sending and receiving partitions 102 and 104 , while still being able to recover from an unplanned failure.
  • FIG. 3 is a block diagram of a redistribution system 100 in accordance with one embodiment of the present invention.
  • the redistribution system 100 of FIG. 3 is the same as that of FIG. 1 , except that the redistribution system 100 of FIG. 3 includes second consistency points 150 and 160 , pages 152 and 162 , and control information 154 and 164 .
  • the redistribution application 108 simultaneously executes two sequences of events.
  • the numbered triangles 1 - 6 indicate how a consistency point is committed (i.e. stored to hard disks).
  • the numbered circles 1 - 3 indicate the same sequence of events shown in FIG. 1 above.
  • the redistribution application 102 When the consistency point 114 is being committed, the redistribution application 102 will continue to operate on the second consistency point 150 .
  • the control information 118 is stored in a control file on the file system hard disk 112 via the memory cache 120 (numbered triangle 1 ).
  • the redistribution application 108 then sends a message containing the changes over the network to the receiving partition 104 (numbered triangle 2 ).
  • the redistribution application 128 at the receiving partition 104 begins writing the control information 128 and then the pages 136 to the file system hard disk 132 at the receiving partition 104 (numbered triangle 3 ).
  • the redistribution application 108 does not write pages 116 to the database management hard disk 110 at the sending partition for a particular consistency point until after the redistribution application 128 writes the corresponding control information 128 and pages 136 to the database management hard disk 130 at the receiving partition 104 . As such, if an error occurs, the redistribution application 108 may reprocess that consistency point at the sending partition. In one embodiment, after the redistribution application 128 at the receiving partition 104 writes the pages 136 to the file system hard disk 132 at the receiving partition 104 , the redistribution application 128 writes the corresponding control information to the database management hard disk 130 (numbered triangle 4 ).
  • the redistribution application 128 at the receiving partition 104 then sends back a notification to the redistribution application 108 at the sending partition 102 to indicate that the redistribution application 108 has successfully stored the changes on the database management hard disk 130 at the receiving partition 104 (numbered triangle 5 ).
  • the redistribution application 108 then writes its pages 116 the database management hard disk 110 , which completes the processing of consistency point 114 .
  • Table 1 shows example content in a status table. On either the sending or receiving partitions 102 and 104 , each consistency point may be in either a “Done” state or a “Not Done” state.
  • the Table 1 shows the corresponding recovery actions for all combinations:
  • the redistribution applications 108 and 128 after a failure, only need to perform a redo operation at the sending partition 102 and perform an undo operation the receiving partition 104 for any of the above-described processing steps associated with any batches of changes that were not successfully stored at the hard disk of the receiving partition 104 before the failure occurred.
  • the information stored in the control file for each consistency point will then be used to redo the operations on the sending partition and undo the operations on the receiving partition.
  • the present invention provides numerous benefits. For example, embodiments of the present invention avoid unnecessary logging by avoiding having to log the actual content of each row being moved thereby saving storage space.
  • the sending partition stores enough information to redo the consistency point in case of a failure.
  • the receiving partition stores enough information to undo the consistency point in case of a failure.
  • minimal logging is that it minimizes the log space requirement so that the system need not reserve inordinate amounts of log space for the operation.
  • minimal logging allows for the redistribution operation to be restartable. Minimal logging also helps to increase performance because fewer system resources are consumed by logging operations.
  • a method and system for redistributing data includes dividing data that is to be redistributed in to batches, where each batch includes pages of changes to memory and corresponding control information.
  • the batches are stored in a cache at a sending partition until the batches are stored in a hard disk at a receiving partition. After the batches are successfully stored in the hard disk at the receiving partition, the batches are stored in a hard disk at the sending partition. As a result, unnecessary logging is avoided and required memory space is minimized.
  • embodiments of the present invention may be implemented using hardware, software, a computer-readable medium containing program instructions, or a combination thereof.
  • Software written according to the present invention or results of the present invention may be stored in some form of computer-readable medium such as memory, hard disk, CD-ROM, DVD, or other media for subsequent purposes such as being executed or processed by a processor, being displayed to a user, etc.
  • software written according to the present invention or results of the present invention may be transmitted in a signal over a network.
  • a computer-readable medium may include a computer-readable signal that may be transmitted over a network. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.

Abstract

A system and method for data redistribution. In one embodiment, the method includes dividing data into batches at a sending partition; populating a first data structure with the first pages and the first control information in a first data structure; storing the first data structure in a cache at the sending partition; sending the changes over the network to the receiving partition; receiving a notification that the changes have been successfully stored in the second hard disk at the receiving partition; and storing, in response to the notification, the changes on the first hard disk at the sending partition.

Description

    FIELD OF THE INVENTION
  • The present invention relates to computer systems, and more particularly to a method and system for data redistribution.
  • BACKGROUND OF THE INVENTION
  • In data warehouse growth scenarios involving the addition of new database partitions to an existing database, it is necessary to redistribute the data in the database to achieve an even balance of data among all the database partitions (i.e., both existing partitions and newly added partitions). Such redistribution scenarios typically involve movement of a significant amount of data, and require downtime while the system is offline.
  • When database manager capacity does not meet present or projected future needs, business can expand its capacity by adding more physical machines. Adding more physical machines can increase both data-storage space and processing power by adding separate single-processor or multiple-processor physical machines. The memory and storage system resources on each machine are typically not shared with the other machines. Although adding machines might result in communication and task-coordination issues, this choice provides the advantage of balancing data and user access across more than one system in shared nothing architecture. When new machines are added to a shared-nothing architecture, existing data needs to be redistributed. This operation is called data redistribution. This data redistribution operation is far more common in data warehouse customers as the amount of data accumulates over time. In addition, as merges and acquisitions become more popular, the need for more capacity also increases.
  • One problem with this operation is that requirement of logging data movement requires a substantial amount of memory space to store the logged data. For example, if one machine is added to a two-machine cluster, 33% of the data of the entire database will need to be deleted on the existing machines and 33% will need to be added on the new machine. This significant amount of data logging not only impacts performance, but also introduces a significant burden on setting up the active logging space and its corresponding backup/restore system.
  • Accordingly, what is needed is an improved method and system for data redistribution. The present invention addresses such a need.
  • SUMMARY OF THE INVENTION
  • A method and system for data redistribution is disclosed. In one embodiment, the method includes dividing data into batches at a sending partition, wherein the data is to be redistributed to a receiving partition, wherein each batch comprises a plurality of first pages and first control information, wherein the plurality of first pages includes changes to a memory, and wherein the control information is used to restart a distribution process in the event of a failure. The method further includes populating a first data structure with the first pages and the first control information in a first data structure. The method further includes storing the first data structure in a cache at the sending partition, wherein the changes are not stored to a first hard disk at the sending partition until after the changes are successfully stored at the receiving partition. The method further includes sending the changes over the network to the receiving partition, wherein the receiving partition populates a second data structure with second pages and second control information, where the plurality of second pages includes the changes, and wherein the changes are subsequently stored in a second hard disk at the receiving partition. The method further includes receiving a notification that the changes have been successfully stored in the second hard disk at the receiving partition. The method further includes storing, in response to the notification, the changes on the first hard disk at the sending partition. According to the system and method disclosed herein, unnecessary logging is avoided and required memory space is minimized.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of a redistribution system in accordance with one embodiment of the present invention.
  • FIG. 2 is a flow chart showing a method for redistributing data in accordance with one embodiment of the present invention.
  • FIG. 3 is a block diagram of a redistribution system in accordance with one embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The present invention relates to computer systems, and more particularly to a method and system for data redistribution. The following description is presented to enable one of ordinary skill in the art to make and use the invention, and is provided in the context of a patent application and its requirements. Various modifications to the preferred embodiment and the generic principles and features described herein will be readily apparent to those skilled in the art. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features described herein.
  • A system and method in accordance with the present invention for redistributing data have been disclosed. The method includes dividing data that is to be redistributed to batches, where each batch includes pages that include changes to memory and corresponding control information. The batches are stored in a cache at a sending partition until the batches are stored in a hard disk at a receiving partition. After the batches are successfully stored in the hard disk at the receiving partition, the batches are stored in a hard disk at the sending partition. If there is a failure in the process, the process may restart by reprocessing the batch that was being processed during the failure. As a result, unnecessary logging is avoided, and required memory space is minimized. To more particularly describe the features of the present invention, refer now to the following description in conjunction with the accompanying figures.
  • FIG. 1 is a block diagram of a redistribution system 100 in accordance with one embodiment of the present invention. FIG. 1 shows a sending system or sending partition 102 and a receiving system or receiving partition 104. The sending partition 102 includes a processor 106, a redistribution application 108 that is executed by the processor 106 and stored in a memory. The sending partition 102 also includes a database management hard disk 110, consistency points 114, pages 116, control information 118, and a memory cache 120. Similarly, the receiving partition 104 includes a processor 126, a redistribution application 128 that is executed by the processor 126 and stored in a memory. The receiving partition also includes a database management hard disk 130, a file system hard disk 132, one or more consistency points 134, pages 136, and control information 138. For ease of illustration, only one consistency point 114 is shown at the sending partition 102, and only one consistency point 134 is shown at the sending partition 104. In operation, the redistribution applications 108 and 128 are operable to divide data into multiple consistency points 114 and 134, as described below.
  • FIG. 2 is a flow chart showing a method for redistributing data in accordance with one embodiment of the present invention. Referring to both FIGS. 1 and 2, in step 202, at the sending partition 102, the redistribution application 108 divides the data that is to be redistributed deterministically into batches, also referred to as consistency points 114, based on invariant database characteristics. The data is divided deterministically into batches, because redo operations may require the same batch size and pages. In particular implementations, the database structure may imply the type of table. In one implementation, the redistribution application 108 may select different batch sizes based on how the table which stored the data was created. In one embodiment, a consistency point is data structure that ties together a series of changes in memory (e.g., pages 116) and corresponding logical structure changes (e.g., control information 118). The consistency point 114 shown in FIG. 1 is the very first consistency point and includes modified or “dirty” pages 116 and corresponding control information 118. As the redistribution application 108 works on the table data stored on the database management hard disk 110, the redistribution application 108 populates the consistency point 114 with the pages and control information (numbered circle 1). In one embodiment, while data is being redistributed, multiple consistency points 114 may exist at any given time. Also, in one embodiment, the redistribution application 108 may store each consistency point 114 in separate consistency point files.
  • In step 204, for each consistency point, the redistribution application 108 stores the changes (e.g., pages 116) on the sending machine in the memory cache 120 at the sending partition 102. In one embodiment, the redistribution application does not store the pages (i.e., changes) to the database management hard disk 110 until after the redistribution application 128 at the receiving partition 104 has successfully stored all its changes (e.g., pages 136) for consistency point 134 to a database management hard disk 130.
  • The redistribution application 108 sends the changes over the network to the receiving partition 104 (numbered circle 2), and the redistribution application 128 populates the consistency point 134 with pages 136 which include the changes, and with control information 128 (numbered circle 3). This process continues until the redistribution application 108 at the sending partition 102 determines that enough data has been processed. Once enough data has been processed, in step 206, the redistribution application 128 at the receiving partition 104 initiates a procedure to commit or store the changes to the database management hard disk 130. In one embodiment, control information 128 is first written to the file system hard disk 132. This ensures that the control information has been written for the undo operation in the event of a failure. Once the control information has been written out, there is no need for caching at the receiver partition 104. As such, the redistribution application 108 may immediately begin writing the changed pages to disk. In one embodiment, before the sending partition 102 sends data, the sending partition needs to update the control information 118, and ensure that the control information 118 is written to file system hard disk 112.
  • In step 206, redistribution application 128 at the receiving partition 104 stores the changes on database management hard disk 130 at the receiving partition 104. In one embodiment, the receiving partition 104 may receive and process data from multiple consistency points (e.g., current consistency points being processed on different sending partitions) in parallel so that the logging scheme is minimized and does not impede concurrency or performance. In the step 208, the redistribution application 128 at the receiving partition 104 notifies the redistribution application 108 at the sending partition 102 that the receiving partition 104 has successfully stored the changes on the database management hard disk 130 at the receiving partition 104. In step 210, in response to the notification, the redistribution application 108 stores the changes on the database management hard disk 110 via the memory cache 120 at the sending partition 102. The redistribution application 108 may then empty the memory cache 120.
  • In one embodiment, there is one control file per consistency point, and once the consistency point is complete (e.g., all of the changes to the data in the consistency point are written to the database management hard disk 110 at the sending partition 102 and to the database management hard disk 130 at the receiving partition 104), the corresponding control files with the control information 118 and 128 may be removed from the consistency point 114 so that the control information does not accumulate and thus freeing up memory space.
  • In one embodiment, the data redistribute operation is restartable in that if the redistribute operation fails due to a problem such as a resource error (e.g., not enough hard disk space, memory, log space, etc.), the redistribution application 108 may simply restart the data redistribution process where it left off without needing to restore the database from the beginning. For example, the redistribution application 108 may restart at the batch after the batch that was last successfully stored at the receiving partition 104. In one embodiment, for each data page being processed, the redistribution application stores a relatively small amount of control information in a control file (CF) at both the sending and receiving partitions 102 and 104 during runtime. In one embodiment, the control information includes the page ID and the number of rows changed in each page. The page ID and the number of rows changed in each page provide a reference point that the redistribution application 108 may use to restart the data redistribution process where it left off before the failure. As such, there is no need to store user data from the control file. Storing only the control information in the control file minimizes the amount of information saved on both the sending and receiving partitions 102 and 104, while still being able to recover from an unplanned failure.
  • FIG. 3 is a block diagram of a redistribution system 100 in accordance with one embodiment of the present invention. The redistribution system 100 of FIG. 3 is the same as that of FIG. 1, except that the redistribution system 100 of FIG. 3 includes second consistency points 150 and 160, pages 152 and 162, and control information 154 and 164. As FIG. 3 shows, in one embodiment, the redistribution application 108 simultaneously executes two sequences of events. As described in more detail below, the numbered triangles 1-6 indicate how a consistency point is committed (i.e. stored to hard disks). In one embodiment, the numbered circles 1-3 indicate the same sequence of events shown in FIG. 1 above.
  • When the consistency point 114 is being committed, the redistribution application 102 will continue to operate on the second consistency point 150. In one embodiment, when the redistribution application 108 first commits consistency point 114, the control information 118 is stored in a control file on the file system hard disk 112 via the memory cache 120 (numbered triangle 1). The redistribution application 108 then sends a message containing the changes over the network to the receiving partition 104 (numbered triangle 2). Upon receiving the message, the redistribution application 128 at the receiving partition 104 begins writing the control information 128 and then the pages 136 to the file system hard disk 132 at the receiving partition 104 (numbered triangle 3). In one embodiment, the redistribution application 108 does not write pages 116 to the database management hard disk 110 at the sending partition for a particular consistency point until after the redistribution application 128 writes the corresponding control information 128 and pages 136 to the database management hard disk 130 at the receiving partition 104. As such, if an error occurs, the redistribution application 108 may reprocess that consistency point at the sending partition. In one embodiment, after the redistribution application 128 at the receiving partition 104 writes the pages 136 to the file system hard disk 132 at the receiving partition 104, the redistribution application 128 writes the corresponding control information to the database management hard disk 130 (numbered triangle 4). The redistribution application 128 at the receiving partition 104 then sends back a notification to the redistribution application 108 at the sending partition 102 to indicate that the redistribution application 108 has successfully stored the changes on the database management hard disk 130 at the receiving partition 104 (numbered triangle 5). In response to the notification, the redistribution application 108 then writes its pages 116 the database management hard disk 110, which completes the processing of consistency point 114.
  • In one embodiment, Table 1 shows example content in a status table. On either the sending or receiving partitions 102 and 104, each consistency point may be in either a “Done” state or a “Not Done” state. The Table 1 shows the corresponding recovery actions for all combinations:
  • TABLE 1
    Sending
    Partition Receiving Partition Sending Partition Receiving Partition
    CP state CP state Recovery Action Recovery Action
    Not Done Not Done Redo CP Undo CP
    Not Done Done Redo CP Nothing
    Done Not Done Error Error
    Done Done Nothing Nothing
  • In one embodiment, after a failure, the redistribution applications 108 and 128 only need to perform a redo operation at the sending partition 102 and perform an undo operation the receiving partition 104 for any of the above-described processing steps associated with any batches of changes that were not successfully stored at the hard disk of the receiving partition 104 before the failure occurred. The information stored in the control file for each consistency point will then be used to redo the operations on the sending partition and undo the operations on the receiving partition.
  • According to the system and method disclosed herein, the present invention provides numerous benefits. For example, embodiments of the present invention avoid unnecessary logging by avoiding having to log the actual content of each row being moved thereby saving storage space. As such, the sending partition stores enough information to redo the consistency point in case of a failure. Also, the receiving partition stores enough information to undo the consistency point in case of a failure. The value of “minimal logging” is that it minimizes the log space requirement so that the system need not reserve inordinate amounts of log space for the operation. At the same time, minimal logging allows for the redistribution operation to be restartable. Minimal logging also helps to increase performance because fewer system resources are consumed by logging operations.
  • A method and system for redistributing data has been disclosed. The method includes dividing data that is to be redistributed in to batches, where each batch includes pages of changes to memory and corresponding control information. The batches are stored in a cache at a sending partition until the batches are stored in a hard disk at a receiving partition. After the batches are successfully stored in the hard disk at the receiving partition, the batches are stored in a hard disk at the sending partition. As a result, unnecessary logging is avoided and required memory space is minimized.
  • The present invention has been described in accordance with the embodiments shown. One of ordinary skill in the art will readily recognize that there could be variations to the embodiments, and that any variations would be within the spirit and scope of the present invention. For example, embodiments of the present invention may be implemented using hardware, software, a computer-readable medium containing program instructions, or a combination thereof. Software written according to the present invention or results of the present invention may be stored in some form of computer-readable medium such as memory, hard disk, CD-ROM, DVD, or other media for subsequent purposes such as being executed or processed by a processor, being displayed to a user, etc. Also, software written according to the present invention or results of the present invention may be transmitted in a signal over a network. In some embodiments, a computer-readable medium may include a computer-readable signal that may be transmitted over a network. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.

Claims (5)

1. A method comprising:
dividing data into batches at a sending partition, wherein the data is to be redistributed to a receiving partition, wherein each batch comprises a plurality of first pages and first control information, wherein the plurality of first pages includes changes to a memory, and wherein the control information is used to restart a distribution process in the event of a failure;
populating a first data structure with the first pages and the first control information in a first data structure;
storing the first data structure in a cache at the sending partition, wherein the changes are not stored to a first hard disk at the sending partition until after the changes are successfully stored at the receiving partition;
sending the changes over the network to the receiving partition, wherein the receiving partition populates a second data structure with second pages and second control information, where the plurality of second pages includes the changes, and wherein the changes are subsequently stored in a second hard disk at the receiving partition;
receiving a notification that the changes have been successfully stored in the second hard disk at the receiving partition; and
storing, in response to the notification, the changes on the first hard disk at the sending partition.
2. The method of claim 1 further comprising removing the control information from the data structure to free up memory space.
3. The method of claim 1 further comprising:
restarting the populating, storing, and sending steps in response to a failure during redistribution of data; and
causing an undo operation at the receiving partition in response to the failure.
4. A system comprising:
a processor; and
a memory for storing an application; and
a first hard disk coupled to the processor, wherein the application is operable to cause the processor to:
divide data into batches at a sending partition, wherein the data is to be redistributed to a receiving partition, wherein each batch comprises a plurality of first pages and first control information, wherein the plurality of first page includes changes to a memory, and wherein the control information is used to restart a distribution process in the event of a failure;
populate a first data structure with the first pages and the first control information in a first data structure;
store the first data structure in a cache at the sending partition, wherein the changes are not stored to the first hard disk at the sending partition until after the changes are successfully stored at the receiving partition;
send the changes over the network to the receiving partition, wherein the receiving partition populates a second data structure with second pages and second control information, where the plurality of second pages includes the changes, and wherein the changes are subsequently stored in a second hard disk at the receiving partition;
receive a notification that the changes have been successfully stored in the second hard disk at the receiving partition; and
store, in response to the notification, the changes on the first hard disk at the sending partition.
5. The system of claim 1 wherein the application is operable to cause the processor to:
restart the populating, storing, and sending steps in response to a failure during redistribution of data; and
cause an undo operation at the receiving partition in response to the failure.
US11/847,270 2007-08-29 2007-08-29 Data redistribution in shared nothing architecture Abandoned US20090063807A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/847,270 US20090063807A1 (en) 2007-08-29 2007-08-29 Data redistribution in shared nothing architecture
US12/274,741 US9672244B2 (en) 2007-08-29 2008-11-20 Efficient undo-processing during data redistribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/847,270 US20090063807A1 (en) 2007-08-29 2007-08-29 Data redistribution in shared nothing architecture

Publications (1)

Publication Number Publication Date
US20090063807A1 true US20090063807A1 (en) 2009-03-05

Family

ID=40409322

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/847,270 Abandoned US20090063807A1 (en) 2007-08-29 2007-08-29 Data redistribution in shared nothing architecture
US12/274,741 Expired - Fee Related US9672244B2 (en) 2007-08-29 2008-11-20 Efficient undo-processing during data redistribution

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/274,741 Expired - Fee Related US9672244B2 (en) 2007-08-29 2008-11-20 Efficient undo-processing during data redistribution

Country Status (1)

Country Link
US (2) US20090063807A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100125555A1 (en) * 2007-08-29 2010-05-20 International Business Machines Corporation Efficient undo-processing during data redistribution
US20110295907A1 (en) * 2010-05-26 2011-12-01 Brian Hagenbuch Apparatus and Method for Expanding a Shared-Nothing System
EP2414928A4 (en) * 2009-03-31 2016-06-08 Emc Corp Data redistribution in data replication systems

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332365B2 (en) * 2009-03-31 2012-12-11 Amazon Technologies, Inc. Cloning and recovery of data volumes
US9207984B2 (en) 2009-03-31 2015-12-08 Amazon Technologies, Inc. Monitoring and automatic scaling of data volumes
US8060792B2 (en) 2009-03-31 2011-11-15 Amazon Technologies, Inc. Monitoring and automated recovery of data instances
US8307003B1 (en) 2009-03-31 2012-11-06 Amazon Technologies, Inc. Self-service control environment
US9705888B2 (en) 2009-03-31 2017-07-11 Amazon Technologies, Inc. Managing security groups for data instances
US8713060B2 (en) 2009-03-31 2014-04-29 Amazon Technologies, Inc. Control service for relational data management
US9135283B2 (en) 2009-10-07 2015-09-15 Amazon Technologies, Inc. Self-service configuration for data environment
US8676753B2 (en) 2009-10-26 2014-03-18 Amazon Technologies, Inc. Monitoring of replicated data instances
US8335765B2 (en) * 2009-10-26 2012-12-18 Amazon Technologies, Inc. Provisioning and managing replicated data instances
US9678871B2 (en) 2013-03-28 2017-06-13 Hewlett Packard Enterprise Development Lp Data flush of group table
CN104937565B (en) 2013-03-28 2017-11-17 慧与发展有限责任合伙企业 Address realm transmission from first node to section point
US10780575B2 (en) * 2017-06-22 2020-09-22 Phd, Inc. Robot end effector cuff

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4945474A (en) * 1988-04-08 1990-07-31 Internatinal Business Machines Corporation Method for restoring a database after I/O error employing write-ahead logging protocols
US5692182A (en) * 1995-10-05 1997-11-25 International Business Machines Corporation Bufferpool coherency for identifying and retrieving versions of workfile data using a producing DBMS and a consuming DBMS
US5692174A (en) * 1995-10-05 1997-11-25 International Business Machines Corporation Query parallelism in a shared data DBMS system
US5892945A (en) * 1996-03-21 1999-04-06 Oracle Corporation Method and apparatus for distributing work granules among processes based on the location of data accessed in the work granules
US5909540A (en) * 1996-11-22 1999-06-01 Mangosoft Corporation System and method for providing highly available data storage using globally addressable memory
US20040030703A1 (en) * 2002-08-12 2004-02-12 International Business Machines Corporation Method, system, and program for merging log entries from multiple recovery log files
US20040047354A1 (en) * 2002-06-07 2004-03-11 Slater Alastair Michael Method of maintaining availability of requested network resources, method of data storage management, method of data storage management in a network, network of resource servers, network, resource management server, content management server, network of video servers, video server, software for controlling the distribution of network resources
US20060155938A1 (en) * 2005-01-12 2006-07-13 Fulcrum Microsystems, Inc. Shared-memory switch fabric architecture
US20060190243A1 (en) * 2005-02-24 2006-08-24 Sharon Barkai Method and apparatus for data management
US7107270B2 (en) * 1998-12-28 2006-09-12 Oracle International Corporation Partitioning ownership of a database among different database servers to control access to the database
US20060294312A1 (en) * 2004-05-27 2006-12-28 Silverbrook Research Pty Ltd Generation sequences

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE148241T1 (en) * 1989-06-13 1997-02-15 Ibm METHOD FOR REMOVEING UNCONFIRMED CHANGES TO STORED DATA BY A DATABASE MANAGEMENT SYSTEM
US5991772A (en) * 1997-10-31 1999-11-23 Oracle Corporation Method and apparatus for restoring a portion of a database
US6374264B1 (en) * 1998-09-04 2002-04-16 Lucent Technologies Inc. Method and apparatus for detecting and recovering from data corruption of a database via read prechecking and deferred maintenance of codewords
WO2000055735A1 (en) * 1999-03-15 2000-09-21 Powerquest Corporation Manipulation of computer volume segments
US20020093527A1 (en) * 2000-06-16 2002-07-18 Sherlock Kieran G. User interface for a security policy system and method
US6959401B2 (en) * 2001-09-04 2005-10-25 Microsoft Corporation Recovery guarantees for software components
US7047441B1 (en) * 2001-09-04 2006-05-16 Microsoft Corporation Recovery guarantees for general multi-tier applications
US8234517B2 (en) * 2003-08-01 2012-07-31 Oracle International Corporation Parallel recovery by non-failed nodes
WO2005024669A1 (en) * 2003-09-04 2005-03-17 Oracle International Corporation Self-managing database architecture
US7707189B2 (en) * 2004-10-05 2010-04-27 Microsoft Corporation Log management system and method
US20070162506A1 (en) * 2006-01-12 2007-07-12 International Business Machines Corporation Method and system for performing a redistribute transparently in a multi-node system
US20090063807A1 (en) * 2007-08-29 2009-03-05 International Business Machines Corporation Data redistribution in shared nothing architecture

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4945474A (en) * 1988-04-08 1990-07-31 Internatinal Business Machines Corporation Method for restoring a database after I/O error employing write-ahead logging protocols
US5692182A (en) * 1995-10-05 1997-11-25 International Business Machines Corporation Bufferpool coherency for identifying and retrieving versions of workfile data using a producing DBMS and a consuming DBMS
US5692174A (en) * 1995-10-05 1997-11-25 International Business Machines Corporation Query parallelism in a shared data DBMS system
US5892945A (en) * 1996-03-21 1999-04-06 Oracle Corporation Method and apparatus for distributing work granules among processes based on the location of data accessed in the work granules
US6505227B1 (en) * 1996-03-21 2003-01-07 Oracle Corporation Method and apparatus for distributing work granules among processes based on the location of data accessed in the work granules
US5909540A (en) * 1996-11-22 1999-06-01 Mangosoft Corporation System and method for providing highly available data storage using globally addressable memory
US7107270B2 (en) * 1998-12-28 2006-09-12 Oracle International Corporation Partitioning ownership of a database among different database servers to control access to the database
US20040047354A1 (en) * 2002-06-07 2004-03-11 Slater Alastair Michael Method of maintaining availability of requested network resources, method of data storage management, method of data storage management in a network, network of resource servers, network, resource management server, content management server, network of video servers, video server, software for controlling the distribution of network resources
US20040030703A1 (en) * 2002-08-12 2004-02-12 International Business Machines Corporation Method, system, and program for merging log entries from multiple recovery log files
US20060294312A1 (en) * 2004-05-27 2006-12-28 Silverbrook Research Pty Ltd Generation sequences
US20060155938A1 (en) * 2005-01-12 2006-07-13 Fulcrum Microsystems, Inc. Shared-memory switch fabric architecture
US20060190243A1 (en) * 2005-02-24 2006-08-24 Sharon Barkai Method and apparatus for data management

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100125555A1 (en) * 2007-08-29 2010-05-20 International Business Machines Corporation Efficient undo-processing during data redistribution
US9672244B2 (en) * 2007-08-29 2017-06-06 International Business Machines Corporation Efficient undo-processing during data redistribution
EP2414928A4 (en) * 2009-03-31 2016-06-08 Emc Corp Data redistribution in data replication systems
US20110295907A1 (en) * 2010-05-26 2011-12-01 Brian Hagenbuch Apparatus and Method for Expanding a Shared-Nothing System
US8768973B2 (en) * 2010-05-26 2014-07-01 Pivotal Software, Inc. Apparatus and method for expanding a shared-nothing system
US9323791B2 (en) 2010-05-26 2016-04-26 Pivotal Software, Inc. Apparatus and method for expanding a shared-nothing system

Also Published As

Publication number Publication date
US20100125555A1 (en) 2010-05-20
US9672244B2 (en) 2017-06-06

Similar Documents

Publication Publication Date Title
US20090063807A1 (en) Data redistribution in shared nothing architecture
US9092475B2 (en) Database log parallelization
US11086850B2 (en) Persisting of a low latency in-memory database
EP1915668B1 (en) Database fragment cloning and management
US9779128B2 (en) System and method for massively parallel processing database
US9069704B2 (en) Database log replay parallelization
US9183245B2 (en) Implicit group commit when writing database log entries
CN108509462B (en) Method and device for synchronizing activity transaction table
US9542279B2 (en) Shadow paging based log segment directory
US9146944B2 (en) Systems and methods for supporting transaction recovery based on a strict ordering of two-phase commit calls
CN107919977B (en) Online capacity expansion and online capacity reduction method and device based on Paxos protocol
KR101574451B1 (en) Imparting durability to a transactional memory system
JP6432805B2 (en) REDO logging for partitioned in-memory data sets
US10055445B2 (en) Transaction processing method and apparatus
US20230098190A1 (en) Data processing method, apparatus, device and medium based on distributed storage
US11176004B2 (en) Test continuous log replay
US20230137119A1 (en) Method for replaying log on data node, data node, and system
CN112214649B (en) Distributed transaction solution system of temporal graph database
CN107533474B (en) Transaction processing method and device
CN115617571A (en) Data backup method, device, system, equipment and storage medium
CN109726211B (en) Distributed time sequence database
US20230315713A1 (en) Operation request processing method, apparatus, device, readable storage medium, and system
US20140250326A1 (en) Method and system for load balancing a distributed database providing object-level management and recovery
US8521682B2 (en) Transfer of data from transactional data sources to partitioned databases in restartable environments
EP3377970B1 (en) Multi-version removal manager

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:COX, PHILIP SHAWN;LAU, LEO T.M.;SARDAR, ADIL MOHAMMAD;AND OTHERS;REEL/FRAME:019774/0140;SIGNING DATES FROM 20070828 TO 20070829

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION