CN102426598A

CN102426598A - Method for clustering Chinese texts for safety management of network content

Info

Publication number: CN102426598A
Application number: CN2011103501200A
Authority: CN
Inventors: 杨更
Original assignee: JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Current assignee: JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority date: 2011-11-08
Filing date: 2011-11-08
Publication date: 2012-04-25

Abstract

The invention relates to a brand new method for clustering Chinese texts based on network content analysis. A clustering number and an initial central point for clustering are automatically confirmed according to a density-based clustering concept, and meanwhile, a convergence criterion for the clustering number is optimized and a complex rate of a clustering algorithm is reduced, thereby being capable of confirming the clustering number and the initial central point on a whole sample base, ensuring the clustering comprehensiveness, avoiding influence of excessive personal factors on a clustering result, and meanwhile, acquiring higher clustering accuracy and efficiency.

Description

A kind of method that is used for the Chinese text cluster of Web content safety management

Technical field

The present invention relates to a kind of method that is used for the Chinese text cluster of Web content safety management.

Background technology

Comprise text classification, text cluster technical research at Web content safety management application emphasis, the purpose of these two types of technology all is that large-scale text data object is divided into groups to form a plurality of classifications.Wherein text cluster is as a kind of unsupervised machine learning method; The technology implementation procedure need not more human factors such as preset document classification, the manual mark of classification and participates in; It is the major technique solution of effectively organizing, making a summary and navigating to the magnanimity text message; Become the research special topic that the magnanimity text message merges direction, information content safety management important applied field such as studied and judged for network public sentiment information supervision, trend and have significant technical support effect, actual application value.

Traditional information clustering method; Mainly can be divided into plane partitioning (partitioning method), hierarchical method (hierarchical method), based on the method (density-based method) of density, amount to five big types based on the method (grid-based method) of grid with based on the method (model-based method) of model; Do explanation with regard to the main representative algorithm of current text cluster below; Analyze its good and bad point, propose improved algorithm simultaneously on this basis.

The plane partitioning provides an initial group technology at first artificially; Change through the method that iterates later on and divide into groups; Till satisfying certain convergence criterion, this algorithm iteration speed is fast, can handle mass data effectively; But can't solve the problem of choosing of initial cluster center, the cluster number also can't accurately be confirmed.It can not find arbitrary shape bunch, choosing of its initial cluster center has very big influence to cluster result.

Stratification carries out decomposing like the similar level to given data set; Till satisfying certain convergence criterion; This clustering method is simpler, but it often runs into the difficulty of merging or split point selection, and algorithm complex is smaller; If but do not select well to merge and split point, then may cause low-quality cluster result.And this algorithm need check and estimate surely a large amount of objects or bunch, be not suitable for the cluster of mass data.

Based on the method for density is exactly just to be added to it in the close with it cluster and to go as long as the density of the point in zone is beaten certain threshold values; Can filter " noise " isolated point data like this; Find Any shape bunch; But it is very responsive to user-defined parameter, and different eps (neighborhood) and MinPts (object minimal amount number) will produce very big influence to the net result of cluster, to such an extent as to cause the huge cluster result of difference.

Data space is divided into the network of limited unit based on the method for grid; Cluster operation all carries out on this network (space that promptly quantizes); Processing speed is very fast, and its processing time is independent of the number of data object, and is only relevant with the number of unit of each dimension in the quantification space; Its cluster quality depends on the granularity of the network bottom; If granularity is thinner, the cost of processing can increase significantly, if but the granularity of the bottom too slightly will reduce the quality that text cluster is analyzed.

Attempt to optimize the match between given data and certain mathematical model based on the method for model, for model of each bunch supposition, seek the best-fit of data to given model, in the practice, its convergence is very fast, but possibly not reach global optimum.For some parameters optimization of giving definite form, convergence can guarantee.Its computation complexity linearity depends on d (input feature vector number), n (number of objects) and t (iterations).

Summary of the invention

The purpose of this invention is to provide and a kind ofly can guarantee the comprehensive of cluster, avoid of the influence of too much human factor, can obtain the method for the Chinese text cluster that is used for the Web content safety management of higher relatively cluster accuracy and efficient simultaneously again cluster result.

For the ease of saying something, understand two definition earlier:

The distance that defines between 1: two vector adopts Euclidean distance:

Figure 2011103501200100002DEST_PATH_IMAGE001

Wherein X=(xi1, xi2 ..., xip) and Y=(yi1, yi2 ..., yip) be the text vector of two P dimensions.

Definition 2: calculate the mean distance between the sample

N is a total sample number;

Figure 2011103501200100002DEST_PATH_IMAGE003

is the number of combinations of getting two points in n the point, and

is the distance between the data object.

A kind of method that is used for the Chinese text cluster of Web content safety management of the present invention, concrete steps are:

1, with document sets D={d1, d2 ..., each the document di among the dn} is as a bunch class Ci={di} with single member, and these bunches class constitutes the cluster C={c1 of D, c2 ..., cn};

2, adopt the mode of definition 1 calculate in twos bunch between Euclidean distance, form the distance matrix between the text vector;

3, according to the distance matrix that obtains, adopt definition 2 calculate all bunches between mean distance, be designated as R, simultaneously Φ=2*R;

4, being the center to each bunch class Ci={di}, is that radius is made ball with R, and the number that drops on the point in the ball is a density, calculates the density of each point;

5, the sample rate according to each point sorts, and finds bunch class of maximal density to be designated as C1;

6, be first cluster centre point with bunch C1, find out and satisfy the point of distance greater than Φ, promptly | C2-C1|>Φ is designated as the 2nd cluster centre point, finds the 3rd point | C3-C1|>Φ is designated as the 3rd cluster centre point; So circulation is up to looking for complete document sets D={d1, d2 ... Till the dn}, find k cluster centre point according to this, can confirm number and the central point Z1 of k like this; Z2 ..., Zk;

7, the K that obtains and K cluster centre Z1, Z2 ..., Zk adopts cluster k-means algorithm iteration as the initial center of K-means algorithm, till K cluster centre no longer changes, obtains K cluster like this.

Combine traditional K-means method to add the selection of improved cluster initial center like this, make the similarity degree of each text vector basis and cluster centre distance to form K mutually disjoint cluster, comparatively similar vector all gathers in same type.

The method that is used for the Chinese text cluster of Web content safety management of the present invention through confirming automatically cluster numbers and cluster initial center point based on the cluster thought of density, is optimized the convergence criterion of cluster numbers simultaneously; Reduce the complexity of clustering algorithm; Can on whole sample storehouse, confirm like this cluster numbers and initial center point to have guaranteed the comprehensive of cluster, avoid of the influence of too much human factor cluster result; It is fast to have iteration speed simultaneously; Can effectively handle the characteristics of large data sets, in the detection to the mass data clustering, accuracy rate and recall rate all have lifting preferably.

Description of drawings

Fig. 1 is the embodiments of the invention structural drawing.

Embodiment

According to above scheme, design a kind of computer system that adopts this clustering method, this computer system comprises:

Text collection, input device in order to system's input text information, and are numbered text;

Text library is in order to speech, characteristic, the vectorization result of storage text;

Text participle device is expressed as speech in order to the sentence with text;

Text feature extracts and the vectorization device, in order to the further vectorization of the text that will be expressed as speech, in order to cluster.

The text cluster device, finally corresponding one by one in order to text cluster according to text library with vectorization, generate the text cluster result.

Text collection, input device link to each other with text participle device, and text participle device is put with text library and the makeup of text feature extraction extreme vector and linked to each other, and text library extracts with text feature and the vectorization device links to each other with the text cluster device, and its cluster may further comprise the steps:

1, text participle device is with the target text Unified Treatment, and the text that sentence is expressed carries out participle;

2, text feature extracts and the vectorization device, and employing information is obtained the weighting word frequency algorithm tfidf algorithm that often uses in the research and carried out feature selecting, has increased the proper vector based on BNN, adopts general VSM mode to realize the document vectorization;

3, the text cluster device uses the method for describing in this patent that text set is carried out text cluster, finally exports cluster result;

With reference to Fig. 1, be preferred embodiment structural drawing of the present invention, gather automatically through user's input or internet; With text load module 10; Word segmentation processing is carried out in the input of 20 pairs of modules 10 of module, and with load module 30 as a result, module 30 is carried out text feature and extracted and vectorization; And deposit the result in module 40, simultaneously with the result as module 50 inputs.Module 50 is carried out the text cluster processing, and result is submitted to user 70 the most at last.

Claims

1. method that is used for the Chinese text cluster of Web content safety management, it is characterized in that: it may further comprise the steps:

(1), with document sets D={d1, d2 ..., each the document di among the dn} is as a bunch class Ci={di} with single member, and these bunches class constitutes the cluster C={c1 of D, c2 ..., cn};

(2), adopt following manner calculate in twos bunch between Euclidean distance, form the distance matrix between the text vector,

Distance between two vectors adopts Euclidean distance:

Figure 2011103501200100001DEST_PATH_IMAGE002

Wherein X=(xi1, xi2 ..., xip) and Y=(yi1, yi2 ..., yip) be the text vector of two P dimensions;

(3), according to the distance matrix that obtains, adopt following manner calculate all bunches between mean distance, be designated as R, Φ=2*R simultaneously,

Calculate the mean distance between the sample

Figure 2011103501200100001DEST_PATH_IMAGE004

N is a total sample number;

Figure 2011103501200100001DEST_PATH_IMAGE006

is the number of combinations of getting two points in n the point, and

Figure 2011103501200100001DEST_PATH_IMAGE008

is the distance between the data object;

(4), be the center to each bunch class Ci={di}, be that radius is made ball with R, the number that drops on the point in the ball is a density, calculates the density of each point;

(5), sort, find bunch class of maximal density to be designated as C1 according to the sample rate of each point;

(6), be first cluster centre point with bunch C1, find out and satisfy the point of distance greater than Φ, promptly | C2-C1|>Φ is designated as the 2nd cluster centre point, finds the 3rd point | C3-C1|>Φ is designated as the 3rd cluster centre point; So circulation is up to looking for complete document sets D={d1, d2 ... Till the dn}, find k cluster centre point according to this, can confirm number and the central point Z1 of k like this; Z2 ..., Zk;

(7), the K that obtains and K cluster centre Z1, Z2 ..., Zk adopts cluster k-means algorithm iteration as the initial center of K-means algorithm, till K cluster centre no longer changes, obtains K cluster like this.