|Publication number||US20070061379 A1|
|Application number||US 11/221,752|
|Publication date||15 Mar 2007|
|Filing date||9 Sep 2005|
|Priority date||9 Sep 2005|
|Also published as||US20090106323, WO2007028248A1|
|Publication number||11221752, 221752, US 2007/0061379 A1, US 2007/061379 A1, US 20070061379 A1, US 20070061379A1, US 2007061379 A1, US 2007061379A1, US-A1-20070061379, US-A1-2007061379, US2007/0061379A1, US2007/061379A1, US20070061379 A1, US20070061379A1, US2007061379 A1, US2007061379A1|
|Inventors||Frankie Wong, Xiong Yu, Elaine Wang|
|Original Assignee||Frankie Wong, Xiong Yu, Elaine Wang|
|Export Citation||BiBTeX, EndNote, RefMan|
|Referenced by (13), Classifications (9), Legal Events (2)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This invention relates generally to the sequencing and processing of transactions within a cluster of replicated databases.
A database has become the core component of most computer application software nowadays. Typically application software makes use of a single or multiple databases as repositories of data (content) required by the application to function properly. The application's operational efficiency and availability is greatly dependent on the performance and availability of these database(s), which can be measured by two metrics: (1) request response time; and (2) transaction throughput.
There are several techniques for improving application efficiency based on these two metrics: (1) Vertical scale up of computer hardware supporting the application—this is achieved by adding to or replacing existing hardware with faster central processing units (CPUs), random access memory (RAM), disk adapters/controllers, and network; and (2) Horizontal scale out (clustering) of computer hardware supporting the application—this approach refers to connecting additional computing hardware to the existing configuration by interconnecting them with a fast network. Although both approaches can address the need of reducing request response time and increase transaction throughput, the scale out approach can offer higher efficiency at lower costs, thus driving most new implementations into clustering architecture.
The clustering of applications can be achieved readily by running the application software on multiple, interconnected application servers that facilitate the execution of the application software and provide hardware redundancy for high availability, with the application software actively processing requests concurrently. However current database clustering technologies cannot provide the level of availability and redundancy in a similar active-active configuration. Consequently database servers are primarily configured as active-standby, meaning that one of the computer systems in the cluster does not process application request until a failover occurs. Active-standby configuration wastes system resources, extends the windows of unavailability and increases the chance of data loss.
To cluster multiple database servers in an active-active configuration, one technical challenge is to resolve update conflict. An update conflict refers to two or more database servers updating the same record in the databases that they manage. Since data in these databases must be consistent among them in order to scale out for performance and achieve high availability, the conflict must be resolved. Currently there are two different schemes of conflict resolution: (1) time based resolution; and (2) location based resolution. However, neither conflict resolution schemes can be enforced without some heuristic decision to be made by human intervention. It is not possible to determine these heuristic decision rules unless there is a thorough understanding of the application software business rules and their implications. Consequently, most clustered database configurations adopt the active-standby model, and fail to achieve high performance and availability at the same time. There is a need for providing a database management system that uses an active-active configuration and substantially reduces the possibility of update conflicts that may occur when two or more databases attempt to update a record at the same time.
The systems and methods disclosed herein provide a system for globally managing transaction requests to one or more database servers and to obviate or mitigate at least some of the above presented disadvantages.
To cluster multiple database servers in an active-active configuration, one technical challenge is to resolve update conflict. An update conflict refers to two or more database servers updating the same record in the databases that they manage. Since data in these databases must be consistent among them in order to scale out for performance and achieve high availability, the conflict must be resolved. Currently there are two different schemes of conflict resolution: (1) time based resolution; and (2) location based resolution. However, neither conflict resolution schemes can be enforced without some heuristic decision to be made by human intervention. Consequently, most clustered database configurations adopt the active-standby model, and fail to achieve high performance and availability at the same time. Contrary to current database configurations there is provided a system and method for receiving and tracking a plurality of transactions and distributing the transactions to at least two replication queues over a network. The system and method comprise a global queue for storing a number of the received transactions in a first predetermined order. The system and method also comprise a sequencer coupled to the global queue for creating a copy of each of the transactions for each of said at least two replication queues and for distributing in a second predetermined order each said copy to each of said at least two replication queues respectively, said copy containing one or more of the received transactions.
One aspect provided is a system for receiving and tracking a plurality of transactions and distributing the transactions to at least two replication queues over a network, the system comprising: a global queue for storing a number of the received transactions in a first predetermined order; and a sequencer coupled to the global queue for creating a copy of each of the transactions for each of said at least two replication queues and for distributing in a second predetermined order each said copy to each of said at least two replication queues respectively, said copy containing one or more of the received transactions.
A further aspect provided is a system for receiving a plurality of transactions from at least one application server, distributing the transactions to at least two replication queues and applying the transactions to a plurality of databases comprising: a director coupled to each of said at least one application server for capturing a plurality of database calls therefrom as the plurality of transactions; and a controller for receiving each of the plurality of transactions, the controller configured for storing the transactions within a global queue in a predetermined order, for generating a copy of each said transaction for each of said at least two replication queues, and for transmitting in the predetermined order each said copy to each of said at least two replication queues respectively.
A still further aspect provided is a method for receiving and tracking a plurality of transactions and distributing the transactions to at least two replication queues over a network, the method comprising: storing a number of the received transactions in a first predetermined order in a global queue; creating a copy of each of the transactions for each of said at least two replication queues; and distributing in a second predetermined order each said copy to each of said at least two replication queues respectively, said copy containing one or more of the received transactions.
A still further aspect provided is a system for receiving and tracking a plurality of transactions and distributing the transactions to at least two replication queues over a network, the system comprising: means for storing a number of the received transactions in a first predetermined order; and means for creating a copy of each of the transactions for each of said at least two replication queues and for distributing in a second predetermined order each said copy to each of said at least two replication queues respectively, said copy containing one or more of the received transactions.
Exemplary embodiments of the invention will now be described in conjunction with the following drawings, by way of example only, in which:
A method and apparatus for sequencing transactions in a database cluster is described for use with computer programs or software applications whose functions are designed primarily to replicate update transactions to one or more databases such that data in these databases are approximately synchronized for read and write access.
Referring to FIGS 1A and 1B, the transaction replicator 1 of the system 10, receives transaction requests from the application servers 7 and provides sequenced and replicated transactions using a controller 2 to one or more replication servers 3, which apply the transactions to the databases 5. By providing sequencing of transactions in two or more tiered application architectures, the transaction replicator 1 helps to prevent the transaction requests from interfering with each other and facilitates the integrity of the databases 5. For example, a transaction refers to a single logical operation from a user application 7 and typically include requests to read, insert, update and delete records within a predetermined database 5.
Referring again to
Referring again to
As is shown in
The transaction replicator 1 maintains the globally sequenced transactions in two different types of queues: the global TX queue 13 and one or more replication queues 15 equal to that of the database server 4 instances. These queues are created using computer memory with spill over area on disks such as the global disk queue 14 and one or more replication disk queues 16. The disk queues serve a number of purposes including: persist transactions to avoid transaction loss during failure of a component in the cluster; act as a very large transaction storage (from gigabytes to terabytes) that computer memory cannot reasonably provide (typically less than 64 gigabytes). Further, the indoubt TX queue 17 is only used when indoubt transactions are detected after a certain system failures. Transactions found in this queue have an unknown transaction state and require either human intervention or pre-programmed resolution methods to resolve.
For example, in the event of a temporary communication failure resulting in lost response from the global TX sequencer 12 to a transaction ID request, the application resends the request which is then placed in the resent TX queue 18. Under this circumstance, there can be two or more transactions with different Transaction ID in the global TX queue 13 and duplicated transactions are removed subsequently.
In normal operation, the controller 2 uses the global TX queue 13 to track the status of each of the input transactions and to send the committed transaction for replication in sequence. Referring to
Before accepting any new transactions in the global TX queue, the sequencer distributes the committed transactions from the global TX queue 13 to a first replication server 20 and a second (or more) replication server 23 for execution against the databases. As will be discussed, the transfer of the transactions to the replication servers can be triggered when at least one of the following two criteria occurs: 1) a predetermined transfer time interval and 2) a predetermined threshold for the total number of transactions within the global TX queue 13 is met. However, each replication server 20, 23 has a respective replication queue 21, 24 and applies the sequenced transactions, obtained from the global queue 13, at its own rate.
For example, when a slower database server is unable to process the transactions at the rate the transactions are distributed by the controller 2, the transactions in the corresponding replication queue are spilled over to the replication disk queues. As shown in
The core functions of the controller 2 can be summarized as registering one or more directors 8 and associating them with their respective replication groups; controlling the replication servers' activities; maintaining the global TX queue 13 that holds all the update transactions sent from the directors 8; synchronizing the global TX queue 13 with the backup controller 9(where applicable); managing all replication groups defined; distributing committed transactions to the replication servers 3; tracking the operational status of each database server 4 within a replication group; providing system status to a monitor 6; and recovering from various system failures.
The registry function of the controller 2 occurs when applications are enabled on a new application server 7 to access databases 5 in a replication group. Here, the director 8 on the new application server contacts the controller 2 and registers itself to the replication group. Advantageously, this provides dynamic provisioning of application servers to scale up system capacity on demand. The registration is performed on the first database call made by an application. Subsequently the director 8 communicates with the controller 2 for transaction and server status tracking.
The replication server control function allows the controller 2 to start the replication servers 3 and monitors their state. For example, when an administrator requests to pause replication to a specific database 5, the controller then instructs the replication server to stop applying transactions until an administrator or an automated process requests it.
The replication group management function allows the controller 2 to manage one or more groups of databases 5 that require transaction synchronization and data consistency among them. The number of replication groups that can be managed and controlled by the controller 2 is dependent upon the processing power of the computer that the controller is operating on and the sum of the transaction rates of all the replication groups.
As an example, the first module 27 captures all JDBC calls 26, determines transaction type and boundary, and analyzes the SQLs in the transaction. Once determined to be an update transaction, the director 8 initiates a sequence of operations to track the progress of the transaction until it ends with a commit or rollback. Both DDL and DML are captured for replication to other databases in the same replication group.
The second module 28 collects a plurality of different statistical elements on transactions and SQL statements for analyzing application execution and performance characteristics. The statistics can be exported as comma delimited text file for importing into a spreadsheet.
In addition to intercepting and analyzing transactions and SQL statements, the director's third module 29, manages database connections for the applications 7. In the event that one of the databases 5 should fail, the director 8 reroutes transactions to one or more of the remaining databases. Whenever feasible, the director 8 also attempts to re-execute the transactions to minimize in flight transaction loss. Accordingly, the director 8 has the ability to instruct the controller 2 as to which database 5 is the primary database for satisfying the request of the respective application 7.
Depending on a database's workload and the relative power settings of the database servers 4 in a replication group, the director 8 routes read transactions to the least busy database server 4 for processing. This also applies when a database server 4 failure has resulted in transaction redirection.
Similarly, if the replication of transactions to a database server 4 becomes too slow for any reason such that the transactions start to build up and spill over to the replication disk queue 16, the director 8 redirects all the read transactions to the least busy database server 4. Once the disk queue becomes empty, the director 8 subsequently allows read access to that database. Accordingly, the fill/usage status of the replication disk queues in the replication group can be obtained or otherwise received by the director 8 for use in management of through-put rate of transactions applied to the respective databases 5.
For example, when the director 8 or replication servers 3 fails to communicate with the database servers 4, they report the failure to the controller 2 which then may redistribute transactions or take other appropriate actions to allow continuous operation of the transaction replicator 1. When one of the database servers 4 cannot be accessed, the controller 2 instructs the replication server 3 to stop applying transactions to it and relays the database lock down status to a monitor 6. The transactions start to accumulate within the queues until the database server 3 is repaired and the administrator or an automated process instructs to resume replication via the monitor 6. The monitor 6 may also provide other predetermined administrative commands (for example: create database alias, update parameters, changing workload balancing setting).
Referring again to
For example, depending on whether the controller is functioning or a failure has occurred, the relevant status of the controller 2 is shown. Second, the status of each of the replication servers 3 within a desired replication group is shown. A detailed description of the transaction rate, the number of transactions within each replication queue 15, the number transactions within each replication disk queue 16 is further shown. The monitor 6 further receives data regarding the databases 5 and displays the status of each database 5 and the number of committed transactions.
The administrator can analyze the above information and choose to manually reroute the transactions. For example, when it is seen that there exists many transactions within the replication disk queue 16 of a particular replication server 3 or that the transaction rate of a replication server 3 is slow, the administrator may send output data in the form of a request 33 to distribute the transactions for a specified amount of time to a different database server within the replication group.
When the controller 2 is started, it initializes itself by reading from configuration and property files the parameters to be used in the current session 101. The global TX Queue 13, indoubt TX queue 17 and resent TX queue 18 shown in
The sequencer 12 examines the incoming transaction to determine whether it is a new transaction or one that has already been recorded in the global TX queue 102. For a new transaction, the sequencer 12 assigns a Transaction ID 103 and records the transaction together with this ID in the global TX queue 13. If the new transactions ID is generated as a result of lost ID 104, the transaction and the ID are stored in the resent TX queue 109 for use in identifying duplicated transactions. The sequencer 12 checks the usage of the global TX queue 105 to determine if the maximum number of transactions in memory has already been exceeded. The sequencer 12 stores the transaction ID in the global TX queue 13 if the memory is not full 106. Otherwise, the sequencer 12 stores the transaction ID in the global disk queue 107. The sequencer 12 then returns the ID to the application 108 and the sequencer 12 is ready to process another request from the application.
When a request from the application or application server 7, comes in with a transaction that has already obtained a transaction ID previously and recorded in the global TX queue 13, the sequencer 12 searches and retrieves the entry from either the global TX queue 13 or the disk queue 110. If this transaction has been committed to the database 111, the entry's transaction status is set to “committed” 112 by the sequencer 12, indicating that this transaction is ready for applying to the other databases 200. If the transaction has been rolled back 113, the entry's transaction status is marked “for deletion” 114 and as will be described, subsequent processing 200 deletes the entry from the global TX queue. If the transaction failed with an indoubt status, the entry's transaction status is set to “indoubt” 115. An alert message is sent to indicate that database recovery may be required 116. Database access is suspended immediately 117 until the indoubt transaction is resolved manually 300 or automatically 400.
For a system with sustained workload, the Q Threshold is the sole determining criteria to move committed transactions to the replication queue 201. For a system with sporadic activities, both the Q Threshold and transfer interval are used to make the transfer decision 201, 213. Transactions are transferred in batches to reduce communication overhead. When one or both criteria are met, the sequencer 12 prepares a batch of transactions to be moved from the global TX queue 13 to the replication queue 202. If the batch contains transactions, the sequencer 12 removes all the rolled back transactions from it because they are not to be applied to the other databases 204. The remaining transactions in the batch are sent to the replication queue for processing 205. If the batch does not contain any transaction 203, the sequencer 12 searches the global TX queue for any unprocessed transactions (status is committing) 206. Since transactions are executed in a same order of occurrence, unprocessed transactions typically occur when a previous transaction has not completed, therefore delaying the processing of subsequent transactions. A transaction that is being committed and has not yet returned its completion status is called gating transaction. A transaction that is being committed and returns a status of unknown is called indoubt transaction. Both types of transactions will remain in the state of “committing” and block processing of subsequent committed transactions, resulting in the transaction batch being empty. The difference between a gating transaction and an indoubt transaction is that gating transaction is transient, meaning that it will eventually become committed, unless there is a system failure that causes it to remain in the “gating state” indefinitely. Therefore when the sequencer 12 finds unprocessed transactions 207 it must differentiate the two types of “committing” transactions 208. For a gating transaction, the sequencer 12 sends out an alert 209 and enters the transaction recovery process 300. Otherwise, the sequencer 12 determines if the transaction is resent from the application 210, 211, and removes the resent transaction from the global TX queue 211. A resent transaction is a duplicated transaction in the global TX queue 13 and has not been moved to the replication queue 15. The sequencer 12 then enters into a sleep because there is no transaction to be processed at the time 214. The sleep process is executed in its own thread such that it does not stop 200 from being executed at any time. It is a second entry point into the global queue size check at 201. When the sleep time is up, the sequencer 12 creates the transaction batch 202 for transfer to the replication queue 203, 204, 205.
Referring again to
If the process is entered because of new transactions 501, the batch of transactions are stored in the replication queue in memory 508, 509, or in replication disk queue 511 if the memory queue is full. Replication disk queue capacity is determined by the amount of disk space available. If the disk is above a predetermined threshold or is full for example 510, an alert is sent 512 by the sequencer 12 and the database is marked unusable 513 because committed transactions cannot be queued up anymore.
If the process is entered in an attempt to apply transactions in the replication queue to the databases, the replication server first determines whether there is any unprocessed transaction in the replication queue in memory 502. If the memory queue is empty but unprocessed transactions are found in the replication disk queue 503, they are moved from the disk queue to the memory queue in batches for execution 504, 505. Upon successful execution of all the transactions in the batch they are removed from the replication queue by the replication server and another batch of transactions are processed 501. If there are transactions in the replication disk queue 16, the processing continues until the disk queue is empty, at which time the replication server 3 waits for more transactions from the global TX queue 501. During execution of the transactions in the replication queue 15, error may occur and the execution must be retried until the maximum number of retries is exceeded 507, then an alert is sent 512 with the database marked unusable 513. However, even though a database is marked unusable, the system continues to serve the application requests. The marked database is inaccessible until the error condition is resolved. The replication server 3 stops when it is instructed by the sequencer during the apparatus shutdown process 118, 119 and 120 shown in
It will be evident to those skilled in the art that the system 10 and its corresponding components can take many forms, and that such forms are within the scope of the invention as claimed. For example, the transaction replicators 1 can be configured as a plurality of transaction replicators 1 in a replicator peer-to-peer (P2P) network, in which each database server 4 is assigned or otherwise coupled to at least one principal transaction replicator 1. The distributed nature of the replicator P2P network can increase robustness in case of failure by replicating data over multiple peers (i.e. transaction replicators 1), and by enabling peers to find/store the data of the transactions without relying on a centralized index server. In the latter case, there may be no single point of failure in the system 10 when using the replicator P2P network. For example, the application or application servers 7 can communicate with a selected one of the database servers 7, such that the replicator P2P network of transaction replicators 1 would communicate with one another for load balancing and/or failure mode purposes. One example would be one application server 7 sending the transaction request to one of the transaction replicators 1, which would then send the transaction request to another of the transaction replicators 1 of the replicator P2P network, which in turn would replicate and then communicate the replicated copies of the transactions to the respective database servers 4.
Further, it is recognized that the applications/application servers 7 could be configured in an application P2P network such that two or more application computers could share their resources such as storage hard drives, CD-ROM drives, and printers. Resources would then accessible from every computer on the application P2P network. Because P2P computers have their own hard drives that are accessible by all computers, each computer can act as both a client and a server in the application P2P networks (e.g. both as an application 7 and as a database 4). P2P networks are typically used for connecting nodes via largely ad hoc connections. Such P2P networks are useful for many purposes, such as but not limited to sharing content files, containing audio, video, data or anything in digital format is very common, and realtime data, such as Telephony traffic, is also passed using P2P technology. The term “P2P network” can also mean grid computing. A pure P2P file transfer network does not have the notion of clients or servers, but only equal peer nodes that simultaneously function as both “clients” and “servers” to the other nodes on the network. This model of network arrangement differs from the client-server model where communication is usually to and from a central server or controller. It is recognized that there are three major types of P2P network, by way of example only, namely:
in view of the above, the spirit and scope of the appended claims should: not be limited to the examples or the description of the preferred versions contained herein.
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7454478 *||30 Nov 2007||18 Nov 2008||International Business Machines Corporation||Business message tracking system using message queues and tracking queue for tracking transaction messages communicated between computers|
|US7587435 *||10 Nov 2006||8 Sep 2009||Sybase, Inc.||Replication system with methodology for replicating database sequences|
|US7765187 *||29 Nov 2005||27 Jul 2010||Emc Corporation||Replication of a consistency group of data storage objects from servers in a data network|
|US7769722||8 Dec 2006||3 Aug 2010||Emc Corporation||Replication and restoration of multiple data storage object types in a data network|
|US8126848||14 Sep 2009||28 Feb 2012||Robert Edward Wagner||Automated method for identifying and repairing logical data discrepancies between database replicas in a database cluster|
|US8209696||13 Feb 2007||26 Jun 2012||Teradata Us, Inc.||Method and system for load balancing a distributed database|
|US8335767||8 Aug 2008||18 Dec 2012||Oracle International Corporation||Maintaining and utilizing SQL execution plan histories|
|US8341178 *||8 Aug 2008||25 Dec 2012||Oracle International Corporation||SQL performance analyzer|
|US8600977||8 Aug 2008||3 Dec 2013||Oracle International Corporation||Automatic recognition and capture of SQL execution plans|
|US8650364 *||28 May 2008||11 Feb 2014||Vixs Systems, Inc.||Processing system with linked-list based prefetch buffer and methods for use therewith|
|US8700608||8 Aug 2008||15 Apr 2014||Oracle International Corporation||SQL execution plan verification|
|US8706833||8 Dec 2006||22 Apr 2014||Emc Corporation||Data storage server having common replication architecture for multiple storage object types|
|US20090300320 *||3 Dec 2009||Jing Zhang||Processing system with linked-list based prefetch buffer and methods for use therewith|
|U.S. Classification||1/1, 707/E17.007, 707/E17.032, 707/999.201|
|Cooperative Classification||G06F11/2097, G06F17/30575|
|European Classification||G06F17/30S7, G06F17/30C|
|4 May 2006||AS||Assignment|
Owner name: AVOKIA INC., CANADA
Free format text: REMOVE 11/211752;ASSIGNORS:WONG, FRANKIE;YU, XIONG;WANG, ELAINE;REEL/FRAME:017583/0030
Effective date: 20051028
|21 Dec 2009||AS||Assignment|
Owner name: OPEN INVENTION NETWORK, LLC, NORTH CAROLINA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AVOKIA, INC.;REEL/FRAME:023679/0208
Effective date: 20090930