CN102622378A - Method and device for detecting events from text flow - Google Patents

Method and device for detecting events from text flow Download PDF

Info

Publication number
CN102622378A
CN102622378A CN201110035163XA CN201110035163A CN102622378A CN 102622378 A CN102622378 A CN 102622378A CN 201110035163X A CN201110035163X A CN 201110035163XA CN 201110035163 A CN201110035163 A CN 201110035163A CN 102622378 A CN102622378 A CN 102622378A
Authority
CN
China
Prior art keywords
text
real
time
characteristic
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110035163XA
Other languages
Chinese (zh)
Inventor
高婷婷
陈冬梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oak Pacific Interactive Technology Development Co Ltd
Original Assignee
Beijing Oak Pacific Interactive Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oak Pacific Interactive Technology Development Co Ltd filed Critical Beijing Oak Pacific Interactive Technology Development Co Ltd
Priority to CN201110035163XA priority Critical patent/CN102622378A/en
Publication of CN102622378A publication Critical patent/CN102622378A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and device for detecting events from a text flow. The method according to the invention can comprise the following steps of: performing real-time preprocessing on the test flow to obtain the feature vector of each text in the text flow; executing real-time online clustering on each preprocessed text based on the feature vector of the text; and identifying events based on the result of the real-time online clustering. The invention provides a technical scheme for identifying events from the text flow through real-time processing on the text flow. The technical scheme has high flexibility, good real-time performance and quick response time. According to the technical scheme of the invention, no manual intervention is required during processing, and the method is an intelligent and adaptive solution and is especially applicable to the text flow generated in Internet.

Description

Detect the method and apparatus of incident from text flow
Technical field
The present invention relates to the technical field of information excavating, relate more particularly to be used for the method and system of the incident that detects from text flow.
Background technology
The arrival in 2.0 epoch of Web has changed the information propagation pattern in 1.0 epoch of Web gradually, and user's role has also taken place to change and reorientated.Through Web 2.0 technology, the cost that information is propagated become very cheap and also efficient very high, that the user can obtain on the internet is bigger, better propagate, the freedom of sharing information.For example, various social networks, resource sharing network, various community, forum, blog, microblogging etc. all are that the user issues various information and content provides various approach.Therefore, will there be more and more contents on the internet, i.e. UGC by user's creation.
The a large amount of generations of UGC and propagation make short text calculating, the extraction of Web text message, text emotion analysis etc. become the hot issue of text mining area research gradually.Simultaneously, also comprising a large amount of important informations that supply to excavate in the Web content that these users create.
In the prior art, the information excavating technology mainly is based on the solution of content of text being carried out the off-line cluster.Based on prior art solutions, among online down will the pending content of text disposable internal memory that all reads computing machine, take various existing clustering algorithms then to these texts execution clusters.Therefore, this scheme obviously has poor real-time.In addition, in operating process, also will all deposit in the internal memory for aftertreatment through the proper vector of handling the text obtain, this takies a large amount of internal memories, has obviously caused the great wasting of resources.In addition, also do not do any processing to obtaining clustering result, so the readability of cluster is very low.
Yet in present internet, the amount of UGC is very big, and is to occur continuously, thereby has formed text flow.The data volume of the text in the text flow is huge and these characteristics are feasible continually, can not use foregoing off-line clustering technique to handle text flow.In addition, the processing of such text flow is required than higher real-time, it requires in the very short response time, to accomplish all processing usually, and the off-line clustering technique obviously can't satisfy this requirement.Moreover because the amount of text in the text flow is huge, this is also very high to the requirement such as memory devices such as internal memories, and this can cause very high cost.Just be based on as above reason, cluster mode of the prior art can't be applicable to the processing to text flow.
For this reason, this area exists and a kind ofly handles the demand with the technical scheme of therefrom excavating useful information for text flow.
Summary of the invention
In view of this, the invention provides a kind of method and apparatus that is used for the incident that detects from text flow, with overcome or at least part eliminate the defective that exists in the prior art.
Network is that present information is propagated one of the fastest approach.Usually, after incident takes place, will on network, carry out wide-scale distribution at once.And just comprising such bulk information in the UGC content, if therefore can from UGC, detect incident in time, then be very useful.Just be based on this imagination, the present invention proposes a kind of new technical scheme.
According to an aspect of the present invention, a kind of method that is used for the incident that detects from text flow is provided.This method comprises: text flow is carried out real-time pre-service, to obtain the proper vector of each text in the text flow; To pretreated each text of process, based on the online cluster of proper vector executive real-time of said each text; And come the identification incident based on said real-time online clustering result.
In one embodiment according to the present invention, text flow is carried out real-time pre-service can be comprised: each text in the text flow is carried out cut the speech operation obtaining the characteristic speech of each text, thereby form the characteristic vocabulary of the characteristic speech that comprises each text; And, calculate the proper vector of each text based on the characteristic speech of characteristic vocabulary and each text.
In according to another embodiment of the present invention,, comprise: calculate through the similarity value of pretreated each text with existing type bunch based on the online cluster of proper vector executive real-time of each text to pretreated each text of process that gets in the internal memory; Based on the similarity value of calculating, said each text is sorted out; And the center of new type bunch of adjustment, when being carried out real-time cluster, uses by next text.
In an embodiment more of the present invention; Text flow is carried out real-time pre-service may further include: extract the eigenwert and the characteristic of correspondence value position of the proper vector of each text in the said text flow, only to store said eigenwert and said characteristic of correspondence value position.
In another embodiment of the invention, come the identification incident to comprise: cluster is identified as concentrated incident for formed big type bunch based on the real-time online clustering result; And formed group of cluster bunch or isolated point be identified as new events.
In another embodiment of the present invention, this method may further include: confirm each type bunch represented implication in the said real-time online clustering result.
According to a second aspect of the invention, a kind of equipment that is used for the incident that detects from text flow is provided.This equipment comprises pretreatment unit, and configuration is used for text flow is carried out real-time pre-service, to obtain the proper vector of each text in the text flow; Online clustering apparatus, configuration are used for to pretreated each text of process, based on the online cluster of proper vector executive real-time of said each text; And the event recognition device, configuration is used for coming the identification incident based on said real-time online clustering result.
According to the present invention, provide a kind of through text flow being handled in real time technical scheme with identification incident from text flow, it has the dirigibility, good real time performance of height and response time fast.And, in processing procedure, need not human intervention according to technical scheme of the present invention, and be a kind of intellectuality, adaptive solution, the text flow that it is particularly useful for producing in the internet.
Description of drawings
Through to combining the shown embodiment of accompanying drawing to be elaborated, above-mentioned and other characteristics of the present invention will be more obvious, and identical label is represented same or analogous parts in the accompanying drawing of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram that is used for detecting from text flow the method for incident according to an embodiment of the invention.
Fig. 2 shows the process flow diagram to the method for the online cluster of text executive real-time that is used for according to an embodiment of the invention.
Fig. 3 shows the process flow diagram that is used for detecting from text flow the equipment of incident according to an embodiment of the invention.
Fig. 4 has schematically shown the block diagram that can realize computer equipment according to the embodiment of the present invention.
Embodiment
Hereinafter, will carry out detailed description to provided by the invention being used for from the method and apparatus of text flow identification incident through embodiment with reference to accompanying drawing.
As shown in the figure, at first as shown in Figure 1, carry out real-time pre-service at step 101 pair text flow, to obtain the proper vector of each text in the text flow.
UGC from subscriber equipment is sent to webserver place continuously as text flow.Such text flow is received at webserver place.Then, can it be read among the internal memory, and text stream is carried out pre-service to obtain the proper vector of each text in the text flow.
At first, carry out seriatim to each text in the text flow and to cut speech operation,, thereby form the characteristic vocabulary that comprises said characteristic speech so that obtaining the characteristic speech of said each text.In practical application, the network user possibly adopt various character libraries to come input text, for example possibly adopt special-shaped fonts such as Chinese-traditional font, Mars word.Therefore, can be preferably earlier to the conversion of the text execution contexts in the text flow, so that carry out follow-up operation based on identical font.For example, can be with special-shaped fonts such as Mars word, Chinese-traditional, convert simplified form of Chinese Character to.
Yet, need to prove that this operation is not to be necessary, this mainly is based on the reason of following two aspects.On the one hand, adopt the user of non-simplified form of Chinese Character also few usually, therefore need not carry out aforesaid text-converted for the processing of the many texts in the text flow.On the other hand, the amount of the UGC in the network is very huge, if ignore the user UGC that uses non-simplified form of Chinese Character font, also can not produce significant impact to the testing result of incident usually.
Yet, for main use for example word bank such as Chinese-traditional be master's area, can be based on the Chinese-traditional executable operations, and can minority font such as " simplified form of Chinese Character " etc. be converted into Chinese-traditional so that operate.
Then, can carry out participle operation, a clause originally is divided into some speech, phrase or phrases of can be independently, having meaning.Participle can realize based on some technology, for example can be based on the segmenting method of string matching, and based on the segmenting method of understanding with based on the segmenting method of statistics, these technology are known in the prior art, repeat no more here.Preferably, in the participle operation, can adopt even numbers group dictionary tree, can obtain higher efficient like this, and save memory source.
Then, can remove symbol or word that stop-word etc. does not have practical significance, so that obtain to have in the text speech of practical significance.Removing stop-word can be based on the predefined vocabulary that stops.For example, can the speech in stopping vocabulary that in text, occurs be removed.Stopping vocabulary can be predefined table, and can constantly upgrade.
Through aforementioned operation, just can be with removals such as symbol nonsensical in the text, words, thus the speech that obtains having concrete implication, promptly the characteristic speech may also be referred to as characteristic item.The characteristic word that obtains can be stored in the characteristic vocabulary, for follow-up use.This characteristic vocabulary for example can the one-dimension array stored in form, different character speech of storage in each array element.Yet, it will be understood by those skilled in the art that the characteristic vocabulary also can store in any suitable manner, the present invention is not limited to this.
Subsequently, can to each text in the text flow, calculate the proper vector v of each text based on the characteristic speech of said characteristic speech vocabulary and said each text.This proper vector can adopt multiple mode to construct.For illustrative purposes, provide the exemplary approach of two simple possible below.
Wherein a kind of mode is to set up following vector:
V=(w 1, w 2..., w k..., w n) formula 1
Wherein, the element number that this vector v has is n, and this n equals the quantity of characteristic speech in the current characteristic vocabulary; " w k" represent k element in this vector, its value is k the number of times that the characteristic speech occurs in the characteristic vocabulary in the text.If this characteristic speech does not occur in the text, then w kValue be 0, if mistake, then the number of times that occurs of this characteristic item is " w k" value.
The proper vector of the text that the front provides is a kind of exemplary embodiment.The present invention is not limited to this, but also can calculate the proper vector of text through any other suitable mode.For example, can construct following proper vector:
V=(w 1, w 2..., w k..., w n) formula 1 '
Wherein, similar with formula 1, the element number that this vector has is n, and this n equals the quantity of characteristic speech in the current characteristic vocabulary; " w k" represent k element in the vector, and w k=tf k* idf k, tf wherein kBe word frequency, i.e. k frequency that the characteristic speech occurs in the characteristic vocabulary, and idf in the text kBe inverse ratio document rate, i.e. inverse ratio document frequency in the pretreated text collection of the process of k characteristic speech in current internal memory.
This inverse ratio document rate idf kFor example can pass through computes:
Idf k=[log 2M-log 2d k]+1 formula 2
Wherein, m is the text sum in the pretreated text collection of the process in the current internal memory, d kIt is the textual data that contains k characteristic item in the pretreated text collection of process that has read in the internal memory.
Therefore,, just can obtain the proper vector v of each text in the text flow, be used for characterizing each text in the text flow through aforesaid mode.
Need to prove; The capacity of internal memory is limited; In order to guarantee resources effective utilization and computing velocity, can set the relevant information that only in internal memory, keeps the text of predetermined number usually, perhaps only keep the relevant information of the text of the generation in the nearest some hrs.Like this, the requirement of the text flow that successively provides pair and memory capacity be can reduce on the one hand, computing velocity and better real-time property faster also can be guaranteed on the other hand.
In addition, preferably, can also handle further, so that save storage space the proper vector of text.The vector space of the text that produces in the internet has the sparse property of height.To a UGC, in a high dimension vector space, have only the data of several dimensions seldom non-vanishing, perhaps in other words, the data of having only several dimensions seldom are characteristics of this proper vector, and other data are 0.Therefore, preserve whole vector and will take big quantity space, cause the significant wastage of resource.
For this reason, the inventor has imagined a kind of new storage mode.According to an illustrative embodiments of the present invention; Proper vector is further handled; Wherein confirm nonzero value (the being eigenwert) position in the proper vector, and extract said eigenwert, so just can in internal memory, only store this eigenwert position and said eigenwert.
Based on this storage mode, can only store eigenwert position and individual features value in the proper vector of text to represent the proper vector of text.On this meaning, this storage means also can be called as the eigenwert storage mode.
For the purpose of clear more, provided the illustrative examples of eigenwert storage mode below.For example, for a proper vector v i=(0,0,0,0,0,0,0,0,0,0,0,0,0.9,0,0,2.3) can obtain eigenwert position 13 and 16 and characteristic of correspondence value 0.9 and 2.3.Can it be stored as v then i13:0.9,16:2.3.
Description by top is clear that very much, through this eigenwert storage means, can reduce the canned data amount that needs greatly, can significantly save memory headroom, and then improves real-time.
Next, as shown in Figure 1, can be in step 102 to the online cluster of the pretreated text flow executive real-time of process.Cluster can be taked K-mean algorithm etc., perhaps other any suitable algorithms.
About the real-time online cluster, with further carrying out detailed description with reference to figure 2, wherein Fig. 2 shows the process flow diagram to the performed real-time online cluster operation of text.
At first, in step 201, can confirm the text D that receives iWhether be first treated text.If text D iIt is first treated text; Then mean and have no existing type bunch; In this case, can set up one in step 202 and comprise that this has read in new type bunch of the pretreated text of process among the internal memory, that is to say; The text is confirmed an initial classes bunch and with its type of confirming as bunch center, and with text D iCentral point as an initial classes bunch.
Needing a kind of situation of this operation of execution is to occur in original state, promptly when carrying out this method first.At this moment, except first through the pretreated text, still do not exist other through pretreated text, so can set up a new class bunch.
Another kind of situation is to occur in method restarts.In one embodiment according to the present invention, can move in the specific time of process (for example 6 hours, 12 hours, 24 hours or any other reasonable time) after, the processing before finishing restarts new operation.Under the situation of resume operations, also need set up new type bunch, establish initial classes bunch central point.One of the purpose that why will restart cluster operation is, the class that forms before avoiding bunch produces adverse influence to the classification of subsequently text flow.Be known that the emerging in large numbers of text flow that belongs to same item bunch has concentrative characteristics in time, crossed after the regular hour section, appearance mainly be and other types bunch relevant text, as different news all can be arranged every day.Therefore,, will influence the high-lighting of class bunch subsequently if some before making type bunch are present in the internal memory for a long time, and the influence real-time of type bunch operation subsequently.
On the other hand, if confirm that in step 201 this text is not first treated text, then can be in step 203 to the text, calculate itself and the similarity of the class that has existed bunch.This similarity for example can be calculated through following formula:
Sim ( Di , Cj ) = Cos ( v i , v j ) = Σ k = 1 n w Ik w Jk Σ k = 1 n w Ik 2 Σ k = 1 n w Jk 2 Formula 3
D wherein iIndicate it the is carried out text of clustering processing, C jRepresent j type bunch in already present type bunch, and v iAnd v jExpression and text D respectively iCharacteristic of correspondence vector sum and j type of bunch C jCentral point characteristic of correspondence vector, w IkThe representation feature vector v iK element, w JkThe representation feature vector v jK element.
The calculating of this similarity value is based on the cosine value of the angle of two vectors, and its scope is between 0 and 1, and this similarity value is big more; The angle that then shows two vectors is more little; The central point of text and such bunch is close more, and when this value was 1, it was identical to represent both.
Then, can said text be sorted out based on the similarity value that calculates.
As shown in Figure 2; Can be in step 204, confirm this similarity value whether more than or equal to predetermined similarity threshold
Figure BSA00000432062900082
so that come text is sorted out based on similarity threshold.If confirm the similarity value of calculating in step 204
Figure BSA00000432062900083
Then the text is included into corresponding class bunch C jIn.
Then, in step 205, can calculate new type bunch the center of incorporating into after the text, and the center that will newly calculate is defined as the center of all texts in such bunch.Like this, when text similarity value is subsequently calculated, just can calculate based on new type bunch central point.Such bunch central points for example can be calculated based on the K-averaging method.
In one embodiment according to the present invention; There is an addition to average the institute in such bunch; That is, the corresponding element addition of the proper vector of all texts is averaged, and the represented text point of the proper vector that will obtain like this is as central point that should new type bunch.In according to another embodiment of the present invention; Can also be with the distance of having a few former type of bunch central point in such bunch as weights; Point to all carries out weighted mean, and based on the central point of the vectorial represented text point that obtains after the weighted mean as new type bunch.
Need to prove, be used to characterize this central point also be one with the similar vector of the proper vector of text, itself and proper vector are similar, in high-dimensional space, have only the data of few several dimensions non-vanishing.Therefore, also can adopt the eigenwert mode to store equally, like this can the save memory space.
If judge that in step 204 all the similarity values calculate all less than predetermined threshold, then enter to step 202, will set up one new type bunch to the text.
Need to prove, to the mode of calculating the similarity value based on the angle of two vectors, the present invention is described in detail hereinbefore.Yet the embodiment that calculates similarity is not limited thereto, but also can adopt other any suitable modes to weigh the similarity between the pretreated text of this process and the existing class bunch.For illustrative purposes, several embodiments that can substitute have been provided hereinafter.
First alternative of similarity value account form is the formula 3 ' that provides as follows:
Sim ( D i , C j ) = F ′ ( v i , v j ) = Σ k = 1 n w Ik w Jk Formula 3 '
This formula is based on inner product of vectors and calculates similarity, and wherein F ' () is a function of asking for the inner product of two vectors, parameter D wherein i, C j, v i, v j, w Ik, w JkImplication is identical with implication in the formula 3, repeats no more here.
Second alternative of similarity value account form is the formula 3 that provides as follows ":
Sim ( D i , C j ) = F ′ ′ ( v i , v j ) = Σ k = 1 n 2 w Ik w Jk Σ k = 1 n w Ik 2 + Σ k = 1 n w Jk 2 Formula 3 "
This formula is based on the Dice coefficient and calculates similarity, wherein F " () be the function of asking for the Dice coefficient of two vectors, parameter D i, C j, v i, v j, w Ik, w JkImplication is identical with implication in formula 3 and 3 ', also repeats no more here.
The 3rd alternative of similarity value account form is the formula 3 that provides as follows " ':
Sim ( D i , C j ) = F ′ ′ ′ ( v i , v j ) = Σ k = 1 n w Ik w Jk Σ k = 1 n w Ik 2 + Σ k = 1 n w Jk 2 - Σ k = 1 n w Ik w Jk Formula 3 " '
This formula is based on the Jacarrd coefficient and calculates similarity, wherein F " ' () be the function of asking for the Jacarrd coefficient of two vectors, parameter D i, C j, v i, v j, w Ik, w JkImplication and formula 3,3 ' and 3 " in implication all identical, therefore also repeat no more.
Need to prove, adopt different similarity value account forms, possibly just different similarity thresholds need be set, judging whether that based on threshold value text is included into a certain type standard also can be slightly different.Yet based on the instruction here, those skilled in the art can realize according to technical scheme of the present invention based on own existing knowledge fully.Therefore, repeat no more here.
Like this, just can be directed against the online cluster of text flow executive real-time, thereby just can obtain corresponding cluster result.
Then, continue as shown in Figure 1ly to come the identification incident based on cluster result in step 103 with reference to figure 1.
In one embodiment according to the present invention, can be after every operation a period of time in system, for example 5 minutes, 10 minutes, half an hour, 1 hour or any other reasonable time obtain cluster result, and come the identification incident based on cluster result.For example, can be identified as concentrated incident with big type in the clustering result bunch, and class that will be less bunch perhaps isolated point be identified as new events.To the processing of concentrating incident also can adopt recognition technology to improve, so that it is identified as focus incident or flame.
In addition, preferably, can also handle further, for example can obtain each type bunch represented implication cluster result.According to an embodiment of the invention, to each type bunch, can at first obtain and the immediate k in the center of such a bunch point (being k text), and put represented semanteme according to this k, extract such keyword.Then, can carry out synonym based on thesaurus handles.
The incident of identification and cluster result further handled resulting keyword and synonym can appear by rights, network security management personnel or other managerial personnel of the webserver carry out further processing so that for example supply.
In addition,, also can come the identification incident based on a class bunch result in real time, and can be in real time or when the result of identification takes place than about-face, the result of identification is presented to the network security management personnel of the webserver based on another embodiment of the present invention.
Hereinbefore; A text based in the text flow is described in detail the present invention, yet it is pointed out that at actual application Chinese version stream be endlessly; Each step of the present invention is as each process on the streamline; Constantly all be in mode of operation at each,, after processing finishes, its operating result sent to next step after, just begin processing immediately to back one text.
This shows that technical scheme provided by the invention is different fully with the technical scheme of off-line cluster of the prior art.Technical scheme according to the present invention is the technical scheme to text flow, and it has high real-time, and has very big scalability, is suitable for handling the text of any amount.Secondly, method of the present invention need not artificial the participation in whole process, does not need any parameter of artificial input, nor needs in advance through any training.In addition, according to certain embodiments of the present invention, the cluster result with interpretation and high availability can be provided.
Therefore, realized a kind ofly through text flow being handled in real time the technical scheme with identification incident from text flow according to the present invention, it has the dirigibility, good real time performance of height and response time fast.And, in processing procedure, need not human intervention according to technical scheme of the present invention, and be a kind of intellectuality, adaptive solution, it is suitable for the text flow that produces in the internet more.
Next, will be used for from the equipment of text flow identification incident with reference to what figure 3 described according to a further aspect in the invention to be provided.
As shown in Figure 3, the equipment 300 that is used for the incident that detects from text flow can comprise pretreatment unit 310, and configuration is used for text flow is carried out real-time pre-service, to obtain the proper vector of each text in the text flow; Online clustering apparatus 320, configuration are used for to pretreated each text of process, based on the online cluster of proper vector executive real-time of said each text; And event recognition device 330, configuration is used for coming the identification incident based on said real-time online clustering result.
In an embodiment of the invention; Said pretreatment unit 310 further comprises: text is cut speech device 312; Configuration is used for each text of said text flow carried out cuts the speech operation obtaining the characteristic speech of said each text, thereby forms the characteristic vocabulary of the characteristic speech that comprises each text; And vector calculation device 314, configuration is used for the characteristic speech based on said characteristic vocabulary and said each text, calculates the proper vector of said each text.
In according to another embodiment of the present invention, said online clustering apparatus 320 further comprises: similarity calculation element 322, configuration are used to calculate through the similarity value of pretreated each text with existing type bunch; And text classification device 324, configuration is used for the similarity value based on said calculating, and said each text is sorted out; And center adjusting gear 326, configuration is used to adjust new type bunch center, uses when next text is carried out real-time cluster.
In an embodiment more according to the present invention; Said pretreatment unit 310 further comprises Vector Processing device 316; Configuration is used for extracting the eigenwert and the characteristic of correspondence value position of proper vector of each text of said text flow, only to store said eigenwert and said characteristic of correspondence value position.
In according to another embodiment of the invention, said event recognition device 430 further configuration is used for: cluster is identified as concentrated incident for formed big type bunch; And formed group of cluster bunch or isolated point be identified as new events.
In an embodiment more according to the present invention, equipment 300 may further include: implication is confirmed device 340, and configuration is used for confirming said each type of real-time online clustering result bunch represented implication.For example, can extracting keywords and carry out synonym and handle.
About the concrete operations of each device in the aforementioned device, can be with reference to the specific descriptions that combine Fig. 1 and Fig. 2 to method of the present invention.
Below, will describe with reference to figure 4 and can realize computer equipment of the present invention.Fig. 4 has schematically shown the block diagram that can realize computer equipment according to the embodiment of the present invention.
Computer system shown in Fig. 4 comprises CPU (CPU) 401, RAM (RAS) 402, ROM (ROM (read-only memory)) 403, system bus 404, hard disk controller 405, KBC 406, serial interface controller 407, parallel interface controller 408, display controller 409, hard disk 410, keyboard 411, serial external unit 412, parallel external unit 413 and display 414.In these parts, what link to each other with system bus 404 has CPU 401, RAM 402, ROM 403, hard disk controller 405, KBC 406, serial interface controller 407, parallel interface controller 408 and a display controller 409.Hard disk 410 links to each other with hard disk controller 405; Keyboard 411 links to each other with KBC 406; Serial external unit 412 links to each other with serial interface controller 407, and parallel external unit 413 links to each other with parallel interface controller 408, and display 414 links to each other with display controller 409.
The described block diagram of Fig. 4 illustrates just to the purpose of example, is not to be limitation of the present invention.In some cases, can add or reduce some equipment wherein based on needs.
In addition, embodiment of the present invention can be realized with the combination of software, hardware or software and hardware.Hardware components can utilize special logic to realize; Software section can be stored in the storer, and by suitable instruction execution system, for example microprocessor or special designs hardware are carried out.Those having ordinary skill in the art will appreciate that can use a computer executable instruction and/or be included in the processor control routine of above-mentioned method and system realizes, for example provides such code on such as the mounting medium of disk, CD or DVD-ROM, such as the programmable memory of ROM (read-only memory) (firmware) or the data carrier such as optics or electronic signal carrier.The system of present embodiment and assembly thereof can by such as VLSI (very large scale integrated circuits) or gate array, such as the semiconductor of logic chip, transistor etc., or realize such as the hardware circuit of the programmable hardware device of field programmable gate array, programmable logic device etc.; Also can use the software of carrying out by various types of processors to realize, also can by the combination of above-mentioned hardware circuit and software for example firmware realize.
Though described the present invention, should be appreciated that to the invention is not restricted to disclosed embodiment with reference to the embodiment of considering at present.On the contrary, the present invention is intended to contain included various modifications and equivalent arrangements in spirit and the scope of accompanying claims.The scope of following claim meets broad interpretation, so that comprise all such modifications and equivalent structure and function.

Claims (12)

1. one kind is used for comprising from the method for text flow detection incident:
Text flow is carried out real-time pre-service, to obtain the proper vector of each text in the text flow;
To pretreated each text of process, based on the online cluster of proper vector executive real-time of said each text; And
Come the identification incident based on said real-time online clustering result.
2. method according to claim 1, wherein saidly text flow is carried out real-time pre-service comprise:
Each text in the said text flow carried out cut the speech operation obtaining the characteristic speech of said each text, thereby form the characteristic vocabulary of the characteristic speech that comprises said each text; And
Based on the characteristic speech of said characteristic vocabulary and said each text, calculate the proper vector of said each text.
3. method according to claim 2, wherein said to through pretreated each text, comprise based on the online cluster of proper vector executive real-time of said each text:
Calculate pretreated each text of said process and existing type bunch similarity value;
Based on the similarity value of said calculating, said each text is sorted out; And the center of new type bunch of adjustment, when being carried out real-time cluster, uses by next text.
4. method according to claim 2, wherein saidly text flow is carried out real-time pre-service further comprise:
Extract the eigenwert and the characteristic of correspondence value position of the proper vector of each text in the said text flow, only to store said eigenwert and said characteristic of correspondence value position.
5. method according to claim 2, wherein saidly come the identification incident to comprise based on the real-time online clustering result:
Cluster is identified as concentrated incident for formed big type bunch; And
Formed group of cluster bunch or isolated point are identified as new events.
6. method according to claim 1 further comprises:
Confirm each type bunch represented implication in the said real-time online clustering result.
7. one kind is used for comprising from the equipment of text flow detection incident:
Pretreatment unit, configuration is used for text flow is carried out real-time pre-service, to obtain the proper vector of each text in the text flow;
Online clustering apparatus, configuration are used for to pretreated each text of process, based on the online cluster of proper vector executive real-time of said each text; And
The event recognition device, configuration is used for coming the identification incident based on said real-time online clustering result.
8. equipment according to claim 7, wherein said pretreatment unit comprises:
Text is cut the speech device, and configuration is used for each text of said text flow carried out cuts the speech operation obtaining the characteristic speech of said each text, thereby forms the characteristic vocabulary of the characteristic speech that comprises said each text; And
The vector calculation device, configuration is used for the characteristic speech based on said characteristic vocabulary and said each text, calculates the proper vector of said each text.
9. equipment according to claim 8, wherein said online clustering apparatus comprises:
Similarity calculation element, configuration are used to calculate pretreated each text of said process and existing type bunch similarity value;
Text is sorted out device, and configuration is used for the similarity value based on said calculating, and said each text is sorted out; And
Center adjusting gear, configuration are used to adjust new type bunch center, use when next text is carried out real-time cluster.
10. equipment according to claim 8 further comprises the Vector Processing device, and configuration is used for:
Extract the eigenwert and the characteristic of correspondence value position of the proper vector of each text in the said text flow, only to store said eigenwert and said characteristic of correspondence value position.
11. equipment according to claim 8, wherein said event recognition device configuration is used for:
Cluster is identified as concentrated incident for formed big type bunch; And
Formed group of cluster bunch or isolated point are identified as new events.
12. equipment according to claim 7 further comprises:
Implication is confirmed device, and configuration is used for confirming said each type of real-time online clustering result bunch represented implication.
CN201110035163XA 2011-01-30 2011-01-30 Method and device for detecting events from text flow Pending CN102622378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110035163XA CN102622378A (en) 2011-01-30 2011-01-30 Method and device for detecting events from text flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110035163XA CN102622378A (en) 2011-01-30 2011-01-30 Method and device for detecting events from text flow

Publications (1)

Publication Number Publication Date
CN102622378A true CN102622378A (en) 2012-08-01

Family

ID=46562301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110035163XA Pending CN102622378A (en) 2011-01-30 2011-01-30 Method and device for detecting events from text flow

Country Status (1)

Country Link
CN (1) CN102622378A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116605A (en) * 2013-01-17 2013-05-22 上海交通大学 Method and system of microblog hot events real-time detection based on detection subnet
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization
CN107315647A (en) * 2017-06-26 2017-11-03 广州视源电子科技股份有限公司 Outlier detection method and system
CN107609103A (en) * 2017-09-12 2018-01-19 电子科技大学 It is a kind of based on push away spy event detecting method
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442778A (en) * 1991-11-12 1995-08-15 Xerox Corporation Scatter-gather: a cluster-based method and apparatus for browsing large document collections
CN1822000A (en) * 2006-02-14 2006-08-23 北大方正集团有限公司 Method for automatic detecting news event
CN101059805A (en) * 2007-03-29 2007-10-24 复旦大学 Network flow and delaminated knowledge library based dynamic file clustering method
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442778A (en) * 1991-11-12 1995-08-15 Xerox Corporation Scatter-gather: a cluster-based method and apparatus for browsing large document collections
CN1822000A (en) * 2006-02-14 2006-08-23 北大方正集团有限公司 Method for automatic detecting news event
CN101187919A (en) * 2006-11-16 2008-05-28 北大方正集团有限公司 Method and system for abstracting batch single document for document set
CN101059805A (en) * 2007-03-29 2007-10-24 复旦大学 Network flow and delaminated knowledge library based dynamic file clustering method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116605A (en) * 2013-01-17 2013-05-22 上海交通大学 Method and system of microblog hot events real-time detection based on detection subnet
CN106886613A (en) * 2017-05-03 2017-06-23 成都云数未来信息科学有限公司 A kind of Text Clustering Method of parallelization
CN106886613B (en) * 2017-05-03 2020-06-26 成都云数未来信息科学有限公司 Parallelized text clustering method
CN107315647A (en) * 2017-06-26 2017-11-03 广州视源电子科技股份有限公司 Outlier detection method and system
CN107609103A (en) * 2017-09-12 2018-01-19 电子科技大学 It is a kind of based on push away spy event detecting method
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device

Similar Documents

Publication Publication Date Title
CN103514183B (en) Information search method and system based on interactive document clustering
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
CN103336766B (en) Short text garbage identification and modeling method and device
CN102073730B (en) Method for constructing topic web crawler system
CN107992596A (en) A kind of Text Clustering Method, device, server and storage medium
US20220318275A1 (en) Search method, electronic device and storage medium
Wang Stock market forecasting with financial micro-blog based on sentiment and time series analysis
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN107357777B (en) Method and device for extracting label information
CN102262765A (en) Method and device for publishing commodity information
CN111783468A (en) Text processing method, device, equipment and medium
CN110334268B (en) Block chain project hot word generation method and device
CN110334209A (en) File classification method, device, medium and electronic equipment
CN102622378A (en) Method and device for detecting events from text flow
Le et al. Aspect analysis for opinion mining of Vietnamese text
CN111309910A (en) Text information mining method and device
CN102073654A (en) Methods and equipment for generating and maintaining web content extraction template
CN109597995A (en) A kind of document representation method based on BM25 weighted combination term vector
CN106569989A (en) De-weighting method and apparatus for short text
CN110309293A (en) Text recommended method and device
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network
Song Sentiment analysis of Japanese text and vocabulary learning based on natural language processing and SVM
Liu et al. Internet news headlines classification method based on the n-gram language model
CN107656916A (en) A kind of anti-technical method of practising fraud of the magnanimity document of Simhash algorithms
Yajian et al. A short text classification algorithm based on semantic extension

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120801