CN104794114A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN104794114A
CN104794114A CN201410020530.2A CN201410020530A CN104794114A CN 104794114 A CN104794114 A CN 104794114A CN 201410020530 A CN201410020530 A CN 201410020530A CN 104794114 A CN104794114 A CN 104794114A
Authority
CN
China
Prior art keywords
data
server
read
database
dimension mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410020530.2A
Other languages
Chinese (zh)
Other versions
CN104794114B (en
Inventor
黄晓锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410020530.2A priority Critical patent/CN104794114B/en
Publication of CN104794114A publication Critical patent/CN104794114A/en
Application granted granted Critical
Publication of CN104794114B publication Critical patent/CN104794114B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a data processing method and device. The data processing method and device are used for solving the problem that in the prior art, because a server restarts and then reads streaming data, network resources are wasted, and a fault may be caused again. The data processing method comprises the steps that after a distributive server conducts first processing on received streaming data, an identification of the distributive server is added to a first processing data and sent to a database to be stored, and after the distributive serer is restarted, the first processing data carrying the server identification are read from the database. By the adoption of the data processing method and device, after the distributive server breaks down and is restarted, it is avoided that all the first processing data are read from the database, it is only required that the first processing data carrying the identification of the distributive server are read, in this way, the amount of the data read by the distributive server from the database can be effectively reduced, the network resources are saved, and the situation that because excessive data are read, the memory of the distributive server overflows, and the distributive server breaks down again is avoided.

Description

A kind of data processing method and device
Technical field
The application relates to field of computer technology, particularly relates to a kind of data processing method and device.
Background technology
Flow data is one group of order, data sequence of arriving in a large number, rapidly, continuously, and generally, flow data can be regarded as one and continue in time and the dynamic data set that increases without limitation.Because flow data has the huge characteristic of data volume, therefore, distributed server process flow data is usually adopted.
At present, the process of distributed server process flow data is generally: the flow data that distributed server receiving target equipment sends, and the flow data received is processed, then the result data obtained after the flow data received and process is sent to database purchase.
For the process of flow data, main process is duplicate removal process and accumulation process.Such as, when will add up buyer's quantity of each commodity class of certain e-commerce website every day now, then distributed server receives a commodity class newly-increased order data now that this e-commerce website sends in real time, namely flow data, and duplicate removal process is carried out to the data of carrying identical buyer mark in order data, the quantity of the data after duplicate removal process is added in result data, finally the data after duplicate removal process and result data is sent to database purchase.
Suppose that the flow data that certain distributed server receives is that user A have purchased certain commodity class commodity 1 now, user B have purchased this commodity class commodity 2 now, user A have purchased this commodity class commodity 2 now, the data that then this distributed server obtains after carrying out duplicate removal process to these three flow datas received are exactly user A and user B, again two data of this after duplicate removal process are added in result data, just obtain this commodity class buyer's quantity now.
Visible, because distributed server is when stream data carries out duplicate removal process, need the flow data contrasting current reception whether identical with the information (as above-mentioned user ID) of specified type in the flow data received before, therefore, distributed server, in the whole process of process flow data, all needs flow data and result data to be stored in the internal memory of self.
In above process, each distributed server carries out management control by utility control center is unified, each distributed server can send heartbeat detection message according to the time interval of setting to utility control center, if utility control center monitors certain distributed server and does not send heartbeat detection message in setting-up time, then determine that this distributed server breaks down, control this distributed server and restart.
But, for a distributed server, the all data in self EMS memory can be removed after this distributed server is restarted, and due to the flow data that distributed server process is follow-up be will carry out processing based on flow data processed before and result data, therefore, after removing the data in self EMS memory, distributed server need from database, read this distributed server before processed flow data and result data so that based on read flow data and result data subsequently received flow data is processed.But due to store in database multiple distributed server send process after flow data and result data, this distributed server after restarting but can not know which flow data of storing in database and result data be the follow-up flow data of this distributed server process will based on flow data and result data, also be, this distributed server after restarting can not know which flow data of storing in database and result data send to flow data and the result data of database before being this distributed server, thus, distributed server can only read flow data and the result data of full dose from database after restarting, and the flow data sent due to this distributed server just do not broken down stored in database, but the flow data that all distributed servers are sent, therefore, the distributed server broken down reads the flow data of full dose from database data volume after restarting is very huge, this not only can waste Internet resources, and cause the internal memory of distributed server self overflow and again break down and restart possibly, thus be absorbed in the endless loop of constantly restarting, cannot normal process flow data.
Such as, add up classification a, classification b, these three commodity classes of classification c process flow data now respectively by server 1, server 2, these three distributed servers of server 3, the result data obtained after the flow data after self duplicate removal process and accumulation process is all sent to database purchase by these three distributed servers.To break down when server 1 and after restarting, because it can not know which flow data stored in database sends to database before being server 1, therefore server 1 can only read the flow data of full dose from database, namely, the flow data that server 1 reads from database be not only before server 1 send to the flow data of database purchase, but these three servers send to whole flow datas of database purchase before, its data volume sends to the data volume of the flow data of database purchase much larger than server 1, so not only waste the Internet resources that these data of transmission consume, server 1 also the very possible internal memory due to self overflow and again fault restart.
Summary of the invention
The embodiment of the present application provides a kind of data processing method and device, in order to solve in prior art when wasting Internet resources during reading flow data after server failure is restarted from database, the problem also probably again broken down.
A kind of data processing method that the embodiment of the present application provides, comprising:
Distributed server receiving stream-oriented data; And
That preserves based on self carries out to the front and continued flow data received before the first front and continued process data that the first process obtains, and carries out the first process obtain the first process data to the flow data received;
In described first process data, add the server identification of described distributed server, the first process data that with the addition of described server identification are sent to database purchase;
After described distributed server is restarted, according to the server identification of self, from described database, read the first process data of carrying described server identification, and based on the first process data read, the first process is proceeded to the subsequent flows data received.
A kind of data processing equipment that the embodiment of the present application provides, comprising:
Receiver module, for receiving stream-oriented data;
First processing module, carries out to the front and continued flow data received before the first front and continued process data that the first process obtains for what preserve based on self, carries out the first process obtain the first process data to the flow data received;
First adds module, for adding the server identification of described device in described first process data;
Sending module, for sending to database purchase by the first process data that with the addition of described server identification;
First read module, for after described device is restarted, according to the server identification of described device, reads the first process data of carrying described server identification from described database;
Described first processing module also for, based on read first process data the first process is proceeded to the subsequent flows data that described receiver module receives.
The embodiment of the present application provides a kind of data processing method and device, the method distributed server carries out after the first process obtains the first process data to the flow data received, the server identification of this distributed server is added in the first process data, and send to database purchase, after distributed server is restarted, from database, read the first process data of carrying this server identification, and based on the first process data read, the first process is proceeded to the subsequent flows data received.Pass through said method, distributed server is breaking down and after restarting, without the need to reading the first process data of full dose from database, only need read the first process data of the server identification carrying this distributed server self, therefore the data volume that distributed server reads data from database can effectively be reduced, thus saving Internet resources, distributed server also will cause self EMS memory overflow and again break down because of the overabundance of data read.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide further understanding of the present application, and form a application's part, the schematic description and description of the application, for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:
The data handling procedure that Fig. 1 provides for the embodiment of the present application;
The data processing equipment structural representation that Fig. 2 provides for the embodiment of the present application.
Embodiment
In prior art, at distributed server after restart, distributed server can not know which flow data stored in database be carry out subsequent treatment will based on data, therefore can only the full dose flow data stored in database be read in internal memory, this not only wastes Internet resources, and easily cause the internal memory of distributed server self to overflow, closely and again break down and restart.The embodiment of the present application is intended to reduce distributed server restarts the data volume that rear needs read data from database, and avoids the excessive problem of again restarting caused of factor data amount.Concrete grammar is: after distributed server process flow data, the server identification of distributed server self is added in the data after its process and send to database purchase again, thus distributed server is breaking down and after restarting, the data of the server identification carrying this distributed server self only need be read from database, effectively can reduce the data volume that distributed server reads data from database, save Internet resources, also will self EMS memory be caused because of the overabundance of data read to overflow and again break down.
For making the object of the application, technical scheme and advantage clearly, below in conjunction with the application's specific embodiment and corresponding accompanying drawing, technical scheme is clearly and completely described.Obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.
The data handling procedure that Fig. 1 provides for the embodiment of the present application, specifically comprises the following steps:
S101: distributed server receiving stream-oriented data.
In the embodiment of the present application, first process dimension mark of the correspondence of each distributed server in predeterminable distributed processing system(DPS), wherein, when presetting the first process dimension corresponding to each distributed server and identifying, corresponding one the first process dimension mark of a predeterminable distributed server.The first described process dimension is designated the process dimension mark of the first process correspondence that distributed server wants stream data to carry out, first process dimension mark is carried in flow data, that is, the distributed server of one first process dimension mark correspondence just represents that this distributed server will carry out the first process corresponding to this first process dimension mark to the flow data received.Then distributed server can receive and carry corresponding with this distributed server first and process the flow data that dimension identifies.
Concrete, above-mentioned each first process dimension mark and the first process dimension mark can be kept in the target device producing flow data with the corresponding relation of distributed server, after target device produces flow data, then can according to the specifying information carried in flow data, corresponding first process dimension mark is added in flow data, and according to the corresponding relation of each first process dimension mark with distributed server, the flow data after with the addition of the first process dimension mark is sent to corresponding distributed server.In addition, target device can be real-time by the flow data of generation add first process dimension identify and send to distributed server, namely, add the first process dimension at once after producing flow data identify and send to distributed server, also periodically the flow data produced in this cycle can be added the first process dimension and identify and send to distributed server.
Also install beforehand one is independent of the distributor of target device and each distributed server, and each first process dimension mark and the first process dimension mark are kept in distributor with the corresponding relation of distributed server, after then each target device produces flow data, flow data can be sent to distributor, by distributor the first process dimension identified again and to add in flow data and to send to distributed server, that is, distributed server can receive from carrying of sending of distributor corresponding with this distributed server first process the flow data that dimension identifies.Concrete, after distributor receives the flow data that target device sends, can according to the specifying information carried in flow data, corresponding first process dimension mark is added in flow data, again according to each first process dimension mark and the corresponding relation of distributed server, the flow data that with the addition of the first process dimension mark is sent to corresponding distributed server.Interpolation real-time for the flow data received first can be processed dimension and identifies and send to distributed server by distributor, also periodically the flow data received in this cycle can be added the first process dimension and identifies and send to distributed server.
Be described to send flow data by distributor to distributed server.Such as, first process comprises two kinds, the first carries out duplicate removal process to buying the buyer of commodity under classification a the same day, the second carries out duplicate removal process to buying the buyer of commodity under classification b the same day, the first process dimension that then the first first process predeterminable is corresponding is designated 011, the second first processes the first corresponding process dimension and is designated 012, and presets the first process dimension mark 011 and process with the corresponding relation, first of server 1 corresponding relation that dimension identifies 012 and server 2.Also namely, what server 1 will carry out first is treated to and carries out duplicate removal process to the buyer on classification a lower same day, and what server 2 will carry out first is treated to and carries out duplicate removal process to the buyer on classification b lower same day.Preset first process dimension mark 011,012 and first process dimension mark is kept in distributor with the corresponding relation of distributed server.
After distributor receives the flow data that target device sends, if the specifying information carried in this flow data comprises category information and the Bidder Information of classification a, then the first process dimension mark 011 is added in this flow data, and according to the corresponding relation of the first process dimension mark 011 with server 1, server 1 is sent to by with the addition of the flow data after 011, if the specifying information carried in this flow data comprises category information and the Bidder Information of classification b, then the first process dimension mark 012 is added in this flow data, and according to the corresponding relation of the first process dimension mark 012 with server 2, server 2 is sent to by with the addition of the flow data after 012.Accordingly, server 1 and server 2 receive the flow data sent from distributor respectively.
S102: that preserves based on self carries out to the front and continued flow data received before the first front and continued process data that the first process obtains, and carries out the first process obtain the first process data to the flow data received.
In the embodiment of the present application, described first process comprises duplicate removal process, sequence process etc., and also, the first process described in the application is will according to the data processing method of carrying out result that the first process obtains before and proceed the first process.
Continue along using example, that will carry out due to server 1 first is treated to and carries out duplicate removal process to the buyer under classification a, therefore suppose that server 1 receives the flow data that distributor sends and is: user A have purchased commodity 1 under classification a, user B have purchased commodity 2 under classification a, user A have purchased commodity 2 under classification a, then server 1 can carry out duplicate removal process to these three flow datas received, namely the first process, the first process data obtained are user A and user B.
The the first process data that with the addition of server identification are sent to database purchase by S103: the server identification adding this distributed server in the first process data.
In the embodiment of the present application, can in advance for each distributed server in distributed processing system(DPS) arrange corresponding server identification, each distributed server can preserve self server identification.When distributed server to be undertaken after the first process obtains the first process data by above-mentioned steps S102 stream data, the server identification of self can add in these the first process data by server.
Continue along using example, suppose that the server identification of server 1 is 001, then (namely server 1 carries out duplicate removal process to the flow data received, first process) obtain the first process data user A and user B after, the mark 001 of self can be added in the first process data user A and user B obtained, the first process data after interpolation are 001+ user A and 001+ user B.
In addition, the corresponding relation of each distributed server and server identification also can be kept in the utility control center of distributed processing system(DPS), by utility control center, server identification corresponding for distributed server can be sent to distributed server when distributed server starts.Such as, utility control center can preserve the Internet protocol (InternetProtocol of distributed server, IP) corresponding relation of address and server identification (as 001 etc.), or preserve the medium education (MediaAccessControl of distributed server, MAC) corresponding relation of address and server identification, and when distributed server starts, according to IP address or the MAC Address of distributed server, the server identification of correspondence is sent to distributed server.Certainly, server identification also directly can be set as IP address or the MAC Address of this distributed server, as long as can this distributed server of unique identification.
S104: after distributed server is restarted, according to the server identification of self, reads the first process data of carrying this server identification, and carries out the first process based on the first process data read to the subsequent flows data received from database.
In the embodiment of the present application, because distributed server sends to the server identification carrying this distributed server self in first of database the process data, therefore, if this distributed server breaks down and restarts, then can according to the server identification of self, know store in database carry this server identification first process data be exactly carry out after this distributed server is restarted subsequent treatment will based on data, thus the first process data of carrying this server identification only need be read from database, and first is carried out to the subsequent flows data received process based on the first process data read, and without the need to reading the data of full dose from database.
Continue along using example, after server 1 breaks down and restarts, according to the server identification 001 of this server 1 self, the first process data of carrying 001 can be read from database, then based on the first process data read, the first process be proceeded to the subsequent flows data received.Concrete, if server identification 001 is kept in this server 1 in advance, after then server 1 is restarted, the server identification 001 can preserved according to self, obtains corresponding first process data, if server identification 001 is kept in utility control center in advance from database, after then server 1 is restarted, this server identification 001 can be obtained from utility control center, and according to the server identification 001 obtained, from database, obtain corresponding first process data.
As can be seen from upper example, server 1 breaks down and after restarting, the first process data of the server identification 001 carrying this server 1 only need be read from database, and the data of database are sent to without the need to reading other distributed servers, therefore the data volume reading data from database can effectively be reduced, save the Internet resources consumed from database to server transmission data, and can effectively avoid server again break down because the overabundance of data read causes self EMS memory to overflow and restart.
In the embodiment of the present application, server carries out after the first process obtains the first process data to the flow data received, can continue that the second process is carried out to the first process data and obtain the second process data, and also send to database to preserve the second process data, wherein, second process comprises accumulation process, process of averaging, maximizing process, process etc. of minimizing, it is will process based on first the result that obtains and carry out the data processing method that result that the second process obtains proceeds the second process before that second described in the application processes.
Process corresponding first and process dimension identify similar with first, the second process dimension mark that in the embodiment of the present application, also predeterminable each distributed server is corresponding.The second process dimension mark described in the application is the process dimension mark of the second process correspondence, also can adopt and add the first process dimension and identify similar method, preset second process dimension mark and the second process dimension mark are kept in target device with the corresponding relation of each distributed server, second process dimension being identified when producing flow data by target device adds in flow data, or, preset second process dimension mark and the second process dimension mark are kept in distributor with the corresponding relation of each distributed server, by distributor the second process dimension identified and add in flow data that target device sends.In the embodiment of the present application, after distributed server carries out to certain flow data the first process data that the first process obtains, in these the first process data, still carry the second process dimension mark of carrying in this flow data.
Certainly, also can according to the corresponding relation of each second process dimension mark preset with distributed server, each second process dimension mark is directly kept in corresponding distributed server, when distributed server stream data is carried out after the first process obtains the first process data, the second process dimension mark of being preserved by distributed server is added in the first process data.
In the embodiment of the present application when presetting the first process dimension mark and the second process dimension mark, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) all can be adopted to set.As, suppose that first is treated to and carries out duplicate removal process to buying the buyer of commodity under classification a the same day, the same day is on Dec 9th, 2013, then can adopt MD5 algorithm, according to " on Dec 9th, 2013 ", " classification a " and " duplicate removal ", the first process dimension setting this first process corresponding is designated 011.
Owing to carrying the second process dimension mark of the second process correspondence in the first process data that distributed server obtains, therefore, distributed server to obtain first process data carry out the second process (as, accumulation process) obtain the second process data after, can according to the second process dimension mark of carrying in these the first process data, in the second process data obtained, also add this second process dimension mark, and the second process data that with the addition of the second process dimension mark are sent to database purchase.
Continue along using example, suppose that the second process comprises two kinds, the first be to duplicate removal process (first process) after the same day buy the buyer of commodity under classification a and carry out accumulation process, the second be to duplicate removal process after the same day buy the buyer of commodity under classification b and carry out accumulation process, the second process dimension that then the first second process predeterminable is corresponding is designated 021, the second second processes the second corresponding process dimension and is designated 022, and preset the corresponding relation of the second process dimension mark 021 and server 1, preset the corresponding relation of the second process dimension mark 022 and server 2.Also be, what server 1 will carry out second is treated to and adds up to buying the buyer of commodity under classification a the same day after duplicate removal process (first process), and what server 2 will carry out second is treated to and adds up to buying the buyer of commodity under classification b the same day after duplicate removal process (the first process).
Visible, the second process dimension mark in the embodiment of the present application is actually and identifies a distributed server and finally will obtain what kind of data, as above in example server 1 most important to data be bought buyer's quantity of commodity under classification a the same day, such data are identified as 021.
Certainly, distributed server can also add other information in the first process data and/or the second process data, such as, also can add processing target mark corresponding to the first process in the first process data, if the first process is duplicate removal process, the target of duplicate removal process is that buyer identifies, and is also, the data of identical buyer's mark are only designated as data, then the processing target mark that the first process is corresponding is buyer's mark.
Suppose that adding the first process dimension by distributor to flow data identifies and the second process dimension mark, and the first process dimension of server 1 correspondence preset is designated 011, represent that server 1 will carry out first is treated to and carries out duplicate removal process to buying the buyer of commodity under classification a the same day, second process dimension of server 1 correspondence preset is designated 021, represent that server 1 will carry out second is treated to and bought the buyer of commodity under classification a the same day after to duplicate removal and carry out accumulation process, then suppose that the flow data that distributor receives is as shown in table 1.
Buyer's user ID Commodity classification Commodity sign
User A Classification a Commodity 1
User B Classification a Commodity 2
User A Classification a Commodity 2
Table 1
Buyer's user ID and classification a is comprised in the flow data as shown in table 1 that then distributor is determined owing to receiving, therefore, first process dimension mark 011 and the second process dimension mark 021 are added in flow data as shown in table 1, the flow data obtained is as shown in table 2.
Table 2
After obtaining flow data as shown in table 2, distributor identifies the corresponding relation of 021 and server 1 according to the first process dimension mark 011 or the second process dimension, and flow data as shown in table 2 is sent to server 1.
After server 1 receives the flow data carrying the first process dimension mark and the second process dimension mark, then (namely duplicate removal process is carried out to the flow data received, first process), obtain duplicate removal data (namely, first process data), added in duplicate removal data by the server identification 001 of this server 1 self, the duplicate removal data obtained are as shown in table 3 again.
Server identification First process dimension mark Second process dimension mark User ID
001 011 021 User A
001 011 021 User B
Table 3
According to the duplicate removal data as shown in table 3 obtained, server 1 continues to carry out accumulation process (namely to duplicate removal data, second process), the cumulative data obtained (namely, second process data) be 2+N, N is the buyer's quantity under original classification a, on the basis that 2+N represents the buyer's quantity under original classification a cumulative 2, again according to the second process dimension mark 021 of carrying in duplicate removal data as shown in table 3, second process dimension mark 021 also added in cumulative data, the cumulative data obtained is as shown in table 4.
Second process dimension mark Accumulation amount
021 2+N
Table 4
After obtaining duplicate removal data as shown in table 3 and cumulative data as shown in table 4, duplicate removal data as shown in table 3 and cumulative data as shown in table 4 are sent to database purchase by server 1.
Follow-up, restart if distributed server breaks down, then distributed server is after reading the first process data of carrying this server identification according to the server identification of self, can according to the second process dimension mark of carrying in the first process data read from database, the second process data of carrying this second process dimension mark are read from database, and based on the first process data read, the first process is proceeded to the subsequent flows data received, again based on the second process data of the subsequent flows data received being carried out to the first subsequent treatment data that the first process obtains and reading, second process is proceeded to the first subsequent treatment data.
Continue along using example, after duplicate removal data as shown in table 3 and cumulative data as shown in table 4 are sent to database purchase by server 1, restart if broken down, then can according to the server identification 001 of self, duplicate removal data as shown in table 3 are read from database, again according to the second process dimension mark 021 of carrying in the duplicate removal data as shown in table 3 read, the cumulative data carrying this second process dimension mark 021 is read from database, namely data as shown in table 4, then, server 1 can based on the duplicate removal data read, duplicate removal process is proceeded to the subsequent flows data received, and based on the subsequent flows data received being carried out to the cumulative data of follow-up duplicate removal data that duplicate removal process obtains and reading, accumulation process is proceeded to follow-up duplicate removal data.
Pass through said method, distributed server when reading the first process data from database, the first process data of carrying this server identification can be read according to the server identification of self, to reduce the data volume reading the first process data from database, when reading the second process data from database, according to the second process dimension mark of carrying in the first process data read, read the second process data of carrying identical second process dimension mark, also the data volume reading the second process data from database can be reduced, thus the data volume reading data from database can be reduced further, reduce the Internet resources of transmission spent by data, and Deterministic service device will cause internal memory overflow and again break down because of the overabundance of data read further.
Further, in order to accelerate the fault recovery speed of distributed server, in the embodiment of the present application, distributed server after restart, directly the first process can be proceeded to the subsequent flows data received after have read the first process data of the server identification carrying this distributed server self from database, and without the need to waiting for that distributed server reads the second process data from database.Also be, after distributed server is restarted, directly the first process can be proceeded to the subsequent flows data received after have read the first process data of the server identification carrying this distributed server self, and carrying out in the process of the first process to the subsequent flows data received, according to the second process dimension mark of carrying in the first process data read, from database, read the second process data of carrying this second process dimension mark.
In addition, due in practical application scene, the data of distributed server process are all generally the data in certain time period, such as, what suppose that server 1 in upper example will obtain is buyer's quantity under classification a on Dec 9th, 2013, if then server 1 only reads the first process data according to the server identification 001 of self after restart from database, (on Dec 8th, 2013 will be comprised by before, on Dec 7th, 2013 etc.) this server 1 send to database first process digital independent out, therefore, in order to reduce the data volume reading data after distributed server is restarted from database further, in the embodiment of the present application distributed server the flow data received is carried out the first process to first process data after, except the server identification of distributed server self being added to except in the first process data, also to the information time of reception receiving flow data corresponding to these the first process data be added in the first process data, again by with the addition of server identification and the time of reception information first process data send to database purchase.Accordingly, after distributed server is restarted, when reading the first process data of carrying the server identification of this distributed server self from database, then can according to each time period divided in advance, determine the time period belonging to current time, read from database and carry this server identification and the first process data in the information time of reception of carrying falls into belonging to current time time period.
Continue along using example, supposing that server 1 receives time of reception of flow data as shown in table 2 that distributor sends is point in the morning 9 on Dec 9th, 2013, then (namely server 1 can carry out duplicate removal process to flow data as shown in table 2, first process) obtain duplicate removal data (namely, first process data), again the server identification 001 of server 1 self and the information time of reception that receives this flow data are added in duplicate removal data, obtain duplicate removal data as shown in table 5.
Table 5
Next, server 1 is according to duplicate removal data as shown in table 5, (namely accumulation process is carried out to these duplicate removal data, second process) obtain cumulative data as shown in table 4 (namely, second process data), and duplicate removal data as shown in table 5 and cumulative data as shown in table 4 are sent to database purchase.
Suppose that the time period divided in advance be every day is a time period, namely 0 o'clock to 24 o'clock of every day is a time period, then after server 1 is restarted, if current time is point in the morning 10 on Dec 9th, 2013, then server 1 determines that the time period belonging to current time is 0 o'clock to 24 o'clock this time period of on Dec 9th, 2013, therefore from database, read the server identification 001 carrying this server 1, and fall into the first process data of (0 o'clock to 24 o'clock this time period of Dec 9 namely 2013 time period belonging to current time) in the time period belonging to current time the time of reception of carrying, the the first process data read are duplicate removal data as shown in table 5, and can not read on Dec 8th, 2013 or before server 1 send to the duplicate removal data of database.
Further, after distributed server is restarted, identifying according to the second process dimension of carrying in the first process data read, read the second process data of carrying this second process dimension mark from database before, can judge whether to exist in the internal memory of self the second process data of carrying this second process dimension mark, if exist, then without the need to reading the second process data of carrying this second process dimension mark from database, if do not exist, then from database, read the second process data of carrying this second process dimension mark.This is because: if distributed server is just restarted, so there are the second process data of carrying this second process dimension mark in its internal memory scarcely, owing to carrying out after the first process obtains the first subsequent treatment data to the subsequent flows data received, need to carry out the second process based on the first subsequent treatment data and the second process data of carrying this second process dimension mark to the first subsequent treatment data, therefore, need to read the second process data of carrying this second process dimension mark from database, and after reading, the second process data of carrying this second process dimension mark have just been present in the internal memory of distributed server, therefore, if distributed server receives follow-up flow data again, just without the need to reading from database again.
In addition, distributed server before read the second process data of carrying this second process dimension mark from database, also can judge the time period belonging to the time of reception of the subsequent flows data received restart with distributed server before time period belonging to the time of reception of flow data of receiving whether identical.If identical, then from database, read the second process data of carrying this second process dimension mark, if different, then without the need to reading the second process data of carrying this second process dimension mark from database.This is because: if the time point having spanned time division section the time of reception of flow data that distributed server receives after restarting, also be, the time period belonging to the time of reception of the flow data that distributed server receives after restarting is different with the time period belonging to the last time of reception receiving flow data, then distributed server is without the need to carrying out the second process based on the second process data before, but directly carry out second and process, therefore, if the time period belonging to the time of reception that distributed server receives subsequent flows data restart from distributed server before time period belonging to the time of reception of flow data of receiving different, then without the need to reading the second process data from database.Such as, distributed server is restarted on Dec 9th, 2013 for 23: 59, the subsequent flows data received after restarting have been 0: 1 of on Dec 10th, 2013, and due to the time period divided be every day be a time period, the data obtained are buyer's quantity of every day, then obvious accumulation process of carrying out the subsequent flows data received is without the need to adding up according to the cumulative data on Dec 9 again, therefore, even if there are not the second process data of carrying this second process dimension mark in the internal memory of now distributed server, server also without the need to reading from database.
Certainly, distributed server before reading the second process data of carrying this second process dimension mark from database, also whether can there are the second process data of carrying this second process dimension mark in the internal memory judging self while, judge the time period belonging to the time of reception of subsequent flows data received restart with distributed server before time period belonging to the time of reception of flow data of receiving whether identical.When the flow data received before distributed server is determined not exist in the internal memory of self time period belonging to the second process data and the time of reception receiving subsequent flows data of carrying this second process dimension mark and restarted the time of reception belonging to time period identical time, read from database and carry second of this second process dimension mark and process data.When distributed server determines to exist in the internal memory of self the second process data of carrying this second process dimension mark, or the time period belonging to the time of reception receiving subsequent flows data from restart before time period belonging to time of reception of flow data of receiving different time, then without the need to reading the second process data of carrying this second process dimension mark from database.
In addition, database is when storing the first process data sent by distributed server, can will carry same server mark and the identical time of reception belonging to time period first process data be stored on several continuous print memory locations, specifically with " time period belonging to server identification+time of reception " for key assignments, first of identical key assignments the process data can be stored on several continuous print memory locations.Then distributed server when reading the first process data from database, can read all data in the memory location that key assignments is " time period belonging to the server identification+current time of this distributed server " place.
In the embodiment of the present application, in order to save the Internet resources transmitting data between database and distributed server and consume further, when distributed server reads the first process data from database, database can compress the first process data that will be transferred to distributed server, and distributed server then can read the server identification that carries this distributed server and process data by first of database compressing from database.Accordingly, when distributed server proceeds the first process based on the first process data read to the subsequent flows data received, then can the first process data read be decompressed, and based on the first process data after decompressing, the first process is carried out to the subsequent flows data received.
Be the data processing method that the embodiment of the present application provides above, based on same thinking, the application also provides a kind of data processing equipment, as shown in Figure 2.
The data processing equipment structural representation that Fig. 2 provides for the embodiment of the present application, specifically comprises:
Receiver module 201, for receiving stream-oriented data;
First processing module 202, carries out to the front and continued flow data received before the first front and continued process data that the first process obtains for what preserve based on self, carries out the first process obtain the first process data to the flow data received;
First adds module 203, for adding the server identification of described device in described first process data;
Sending module 204, for sending to database purchase by the first process data that with the addition of described server identification;
First read module 205, for after described device is restarted, according to the server identification of described device, reads the first process data of carrying described server identification from described database;
Described first processing module 202 also for, based on read first process data the first process is proceeded to the subsequent flows data that described receiver module 201 receives.
Described receiver module 201 specifically for, receive and carry corresponding with described device first and process the flow data that dimension identifies, wherein, described first process dimension mark is that described device will to the process dimension mark of the first process correspondence that the flow data received carry out, and described first process dimension mark presets with the corresponding relation of described device.
Described device also comprises:
Second processing module 206, after the first process obtains the first process data for carrying out the flow data received in described first processing module 202, carries out the second process to described first process data and obtains the second process data;
Second adds module 207, for identifying according to the second process dimension of carrying in described first process data, in described second process data, add described second process dimension mark, wherein, described second process dimension mark is the process dimension mark of described second process correspondence;
Described sending module 204 also for, by with the addition of described second process dimension mark second process data send to database purchase;
Second read module 208, after based on the first process data read the first process being proceeded to the subsequent flows data received in described first processing module 202, according to the second process dimension mark of carrying in the first process data that described first read module 205 reads from described database, from described database, read the second process data of carrying described second process dimension mark;
Described second processing module 206 also for, based on described first processing module 202, the first subsequent treatment data that first process obtains and the second process data that described second read module 208 reads are carried out to the subsequent flows data received, the second process is carried out to described first subsequent treatment data.
Described first add module 203 also for, in described first process data, add the information time of reception, described time of reception, information was the time information that described receiver module 201 receives flow data corresponding to described first process data;
Described first read module 205 specifically for, according to each time period divided in advance, determine the time period belonging to current time, read from described database and carry described server identification and the first process data in the information time of reception of carrying falls into belonging to described current time time period.
Described second read module 208 specifically for, when described distributed server determines not exist in the internal memory of self the second process data of carrying described second process dimension mark, from described database, read the second process data of carrying described second process dimension mark; Or, when the time period belonging to the time of reception that described distributed server receives described subsequent flows data with restart before time period belonging to time of reception of flow data of receiving identical time, from described database, read the second process data of carrying described second process dimension mark; Or, when the flow data received before described distributed server is determined not exist in the internal memory of self time period belonging to the second process data and the time of reception receiving described subsequent flows data of carrying described second process dimension mark and restarted the time of reception belonging to time period identical time, read from described database and carry second of described second process dimension mark and process data.
Concrete above-mentioned data processing equipment can be arranged in server.
The embodiment of the present application provides a kind of data processing method and device, the method distributed server carries out after the first process obtains the first process data to the flow data received, the server identification of this distributed server is added in the first process data, and send to database purchase, after distributed server is restarted, from database, read the first process data of carrying this server identification, and based on the first process data read, the first process is proceeded to the subsequent flows data received.Pass through said method, distributed server is breaking down and after restarting, without the need to reading the first process data of full dose from database, only need read the first process data of the server identification carrying this distributed server self, therefore the data volume that distributed server reads data from database can effectively be reduced, thus saving Internet resources, distributed server also will cause self EMS memory overflow and again break down because of the overabundance of data read.
In one typically configuration, computing equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.
Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flashRAM).Internal memory is the example of computer-readable medium.
Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise temporary computer readable media (transitory media), as data-signal and the carrier wave of modulation.
Also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, commodity or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, commodity or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment comprising described key element and also there is other identical element.
It will be understood by those skilled in the art that the embodiment of the application can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the application can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The foregoing is only the embodiment of the application, be not limited to the application.To those skilled in the art, the application can have various modifications and variations.Any amendment done within all spirit in the application and principle, equivalent replacement, improvement etc., within the right that all should be included in the application.

Claims (10)

1. a data processing method, is characterized in that, comprising:
Distributed server receiving stream-oriented data; And
That preserves based on self carries out to the front and continued flow data received before the first front and continued process data that the first process obtains, and carries out the first process obtain the first process data to the flow data received;
In described first process data, add the server identification of described distributed server, the first process data that with the addition of described server identification are sent to database purchase;
After described distributed server is restarted, according to the server identification of self, from described database, read the first process data of carrying described server identification, and based on the first process data read, the first process is proceeded to the subsequent flows data received.
2. the method for claim 1, is characterized in that, distributed server receiving stream-oriented data, specifically comprises:
Described distributed server receives and carries corresponding with described distributed server first and process the flow data that dimension identifies, wherein, described first process dimension mark is that described distributed server will to the process dimension mark of the first process correspondence that the flow data received carry out, and described first process dimension mark presets with the corresponding relation of described distributed server.
3. the method for claim 1, is characterized in that, carry out after the first process obtains the first process data to the flow data received, described method also comprises:
Second process is carried out to described first process data and obtains the second process data; And
According to the second process dimension mark of carrying in described first process data, described second process dimension mark is added in described second process data, the the second process data that with the addition of described second process dimension mark are sent to database purchase, wherein, described second process dimension mark is the process dimension mark of described second process correspondence;
After described distributed server is restarted, after proceeding the first process based on the first process data read to the subsequent flows data received, described method also comprises:
According to the second process dimension mark of carrying in the first process data read from described database, from described database, read the second process data of carrying described second process dimension mark;
Based on the second process data of the subsequent flows data received being carried out to the first subsequent treatment data that the first process obtains and reading, the second process is proceeded to described first subsequent treatment data.
4. method as claimed in claim 3, is characterized in that, carry out after the first process obtains the first process data to the flow data received, described method also comprises:
In described first process data, add the information time of reception, described time of reception, information was the time information that described server receives flow data corresponding to described first process data;
From described database, read the first process data of carrying described server identification, specifically comprise:
According to each time period divided in advance, determine the time period belonging to current time, read from described database and carry described server identification and the first process data in the information time of reception of carrying falls into belonging to described current time time period.
5. method as claimed in claim 4, it is characterized in that, when described distributed server determines not exist in the internal memory of self the second process data of carrying described second process dimension mark, from described database, read the second process data of carrying described second process dimension mark; Or
When the time period belonging to the time of reception that described distributed server receives described subsequent flows data with restart before time period belonging to time of reception of flow data of receiving identical time, from described database, read the second process data of carrying described second process dimension mark; Or
When the flow data received before described distributed server is determined not exist in the internal memory of self time period belonging to the second process data and the time of reception receiving described subsequent flows data of carrying described second process dimension mark and restarted the time of reception belonging to time period identical time, read from described database and carry second of described second process dimension mark and process data.
6. a data processing equipment, is characterized in that, comprising:
Receiver module, for receiving stream-oriented data;
First processing module, carries out to the front and continued flow data received before the first front and continued process data that the first process obtains for what preserve based on self, carries out the first process obtain the first process data to the flow data received;
First adds module, for adding the server identification of described device in described first process data;
Sending module, for sending to database purchase by the first process data that with the addition of described server identification;
First read module, for after described device is restarted, according to the server identification of described device, reads the first process data of carrying described server identification from described database;
Described first processing module also for, based on read first process data the first process is proceeded to the subsequent flows data that described receiver module receives.
7. device as claimed in claim 6, it is characterized in that, described receiver module specifically for, receive and carry corresponding with described device first and process the flow data that dimension identifies, wherein, described first process dimension mark is that described device will to the process dimension mark of the first process correspondence that the flow data received carry out, and described first process dimension mark presets with the corresponding relation of described device.
8. device as claimed in claim 6, it is characterized in that, described device also comprises:
Second processing module, after the first process obtains the first process data for carrying out the flow data received in described first processing module, carries out the second process to described first process data and obtains the second process data;
Second adds module, for identifying according to the second process dimension of carrying in described first process data, in described second process data, add described second process dimension mark, wherein, described second process dimension mark is the process dimension mark of described second process correspondence;
Described sending module also for, by with the addition of described second process dimension mark second process data send to database purchase;
Second read module, after based on the first process data read the first process being proceeded to the subsequent flows data received in described first processing module, according to the second process dimension mark of carrying in the first process data that described first read module reads from described database, from described database, read the second process data of carrying described second process dimension mark;
Described second processing module also for, based on described first processing module, the first subsequent treatment data that the first process obtains and the second process data that described second read module reads are carried out to the subsequent flows data received, the second process is carried out to described first subsequent treatment data.
9. device as claimed in claim 8, it is characterized in that, described first add module also for, in described first process data, add the information time of reception, described time of reception, information was the time information that described receiver module receives flow data corresponding to described first process data;
Described first read module specifically for, according to each time period divided in advance, determine the time period belonging to current time, read from described database and carry described server identification and the first process data in the information time of reception of carrying falls into belonging to described current time time period.
10. device as claimed in claim 9, it is characterized in that, described second read module specifically for, when described distributed server determines not exist in the internal memory of self the second process data of carrying described second process dimension mark, from described database, read the second process data of carrying described second process dimension mark; Or, when the time period belonging to the time of reception that described distributed server receives described subsequent flows data with restart before time period belonging to time of reception of flow data of receiving identical time, from described database, read the second process data of carrying described second process dimension mark; Or, when the flow data received before described distributed server is determined not exist in the internal memory of self time period belonging to the second process data and the time of reception receiving described subsequent flows data of carrying described second process dimension mark and restarted the time of reception belonging to time period identical time, read from described database and carry second of described second process dimension mark and process data.
CN201410020530.2A 2014-01-16 2014-01-16 A kind of data processing method and device Active CN104794114B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410020530.2A CN104794114B (en) 2014-01-16 2014-01-16 A kind of data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410020530.2A CN104794114B (en) 2014-01-16 2014-01-16 A kind of data processing method and device

Publications (2)

Publication Number Publication Date
CN104794114A true CN104794114A (en) 2015-07-22
CN104794114B CN104794114B (en) 2018-04-03

Family

ID=53558911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410020530.2A Active CN104794114B (en) 2014-01-16 2014-01-16 A kind of data processing method and device

Country Status (1)

Country Link
CN (1) CN104794114B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339448A (en) * 2016-08-24 2017-01-18 苏州新科兰德科技有限公司 Spark-based streaming data processing method and device
WO2017118318A1 (en) * 2016-01-06 2017-07-13 阿里巴巴集团控股有限公司 Data storage and service processing method and device
WO2017219858A1 (en) * 2016-06-20 2017-12-28 阿里巴巴集团控股有限公司 Streaming data distributed processing method and device
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN108121514A (en) * 2017-12-29 2018-06-05 北京奇虎科技有限公司 Metamessage update method, device, computing device and computer storage media
WO2018210062A1 (en) * 2017-05-18 2018-11-22 北京京东尚科信息技术有限公司 Method and device for determining inventory of items by server system
CN109885729A (en) * 2019-02-20 2019-06-14 北京奇艺世纪科技有限公司 A kind of method, apparatus and system showing data
CN110297578A (en) * 2019-06-27 2019-10-01 北京金山安全软件有限公司 Method and device for processing partial data in mass data in batch and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832119A (en) * 1993-11-18 1998-11-03 Digimarc Corporation Methods for controlling systems using control signals embedded in empirical data
CN103095758A (en) * 2011-11-01 2013-05-08 北大方正集团有限公司 Distributed file system and file data processing method thereof
CN103188282A (en) * 2011-12-27 2013-07-03 腾讯科技(北京)有限公司 Method, system and equipment for file synchronization
CN103237075A (en) * 2013-04-24 2013-08-07 广州市动景计算机科技有限公司 Data synchronizing method, device and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832119A (en) * 1993-11-18 1998-11-03 Digimarc Corporation Methods for controlling systems using control signals embedded in empirical data
US5832119C1 (en) * 1993-11-18 2002-03-05 Digimarc Corp Methods for controlling systems using control signals embedded in empirical data
CN103095758A (en) * 2011-11-01 2013-05-08 北大方正集团有限公司 Distributed file system and file data processing method thereof
CN103188282A (en) * 2011-12-27 2013-07-03 腾讯科技(北京)有限公司 Method, system and equipment for file synchronization
CN103237075A (en) * 2013-04-24 2013-08-07 广州市动景计算机科技有限公司 Data synchronizing method, device and system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017118318A1 (en) * 2016-01-06 2017-07-13 阿里巴巴集团控股有限公司 Data storage and service processing method and device
TWI686703B (en) * 2016-01-06 2020-03-01 香港商阿里巴巴集團服務有限公司 Method and device for data storage and business processing
WO2017219858A1 (en) * 2016-06-20 2017-12-28 阿里巴巴集团控股有限公司 Streaming data distributed processing method and device
TWI662426B (en) * 2016-06-20 2019-06-11 香港商阿里巴巴集團服務有限公司 Method and device for distributed stream data processing
US11036562B2 (en) 2016-06-20 2021-06-15 Advanced New Technologies Co., Ltd. Streaming data distributed processing method and device
CN106339448A (en) * 2016-08-24 2017-01-18 苏州新科兰德科技有限公司 Spark-based streaming data processing method and device
WO2018210062A1 (en) * 2017-05-18 2018-11-22 北京京东尚科信息技术有限公司 Method and device for determining inventory of items by server system
CN108108126A (en) * 2017-12-15 2018-06-01 北京奇艺世纪科技有限公司 A kind of data processing method, device and equipment
CN108108126B (en) * 2017-12-15 2021-05-28 北京奇艺世纪科技有限公司 Data processing method, device and equipment
CN108121514A (en) * 2017-12-29 2018-06-05 北京奇虎科技有限公司 Metamessage update method, device, computing device and computer storage media
CN109885729A (en) * 2019-02-20 2019-06-14 北京奇艺世纪科技有限公司 A kind of method, apparatus and system showing data
CN110297578A (en) * 2019-06-27 2019-10-01 北京金山安全软件有限公司 Method and device for processing partial data in mass data in batch and electronic equipment

Also Published As

Publication number Publication date
CN104794114B (en) 2018-04-03

Similar Documents

Publication Publication Date Title
CN104794114A (en) Data processing method and device
JP6669682B2 (en) Cloud server scheduling method and apparatus
CN106550003B (en) Load balancing control method, device and system
CN110995513B (en) Data sending and receiving method in Internet of things system, internet of things equipment and platform
TWI622933B (en) Client update method and device
CN109669776B (en) Detection task processing method, device and system
CN106789377B (en) Service parameter updating method of network element cluster
EP3489825A1 (en) Method, apparatus and computer readable storage medium for processing service
CN107276970B (en) Unbinding and binding method and device
CN106815254A (en) A kind of data processing method and device
CN112954045B (en) Data transmission method, device, medium and electronic equipment in node
CN112650575B (en) Resource scheduling method, device and cloud service system
CN111949389B (en) Slurm-based information acquisition method and device, server and computer-readable storage medium
CN104850556A (en) Method and device for data processing
CN104793982A (en) Method and device for establishing virtual machine
CN104484167A (en) Task processing method and device
KR102114532B1 (en) Information operation
CN106789301B (en) Method and device for generating running log of WEB gateway
CN113312553A (en) Method and device for determining user label
CN110166318B (en) Data statistical method and device
CN104579753A (en) User data statistical method, access layer server and statistical server
CN110753006A (en) Data processing method and device and electronic equipment
CN111352803A (en) Service data processing method, device, equipment and storage medium
CN109034894B (en) Advertisement page browsing amount statistical method and device, electronic equipment and storage medium
CN110990039A (en) Application program updating method, device and system and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant