CN102426598A - Method for clustering Chinese texts for safety management of network content - Google Patents

Method for clustering Chinese texts for safety management of network content Download PDF

Info

Publication number
CN102426598A
CN102426598A CN2011103501200A CN201110350120A CN102426598A CN 102426598 A CN102426598 A CN 102426598A CN 2011103501200 A CN2011103501200 A CN 2011103501200A CN 201110350120 A CN201110350120 A CN 201110350120A CN 102426598 A CN102426598 A CN 102426598A
Authority
CN
China
Prior art keywords
cluster
point
clustering
distance
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103501200A
Other languages
Chinese (zh)
Inventor
杨更
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Original Assignee
JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd filed Critical JUNGONGSIBO INFORMATION TECHNOLOGY INDUSTRY Co Ltd
Priority to CN2011103501200A priority Critical patent/CN102426598A/en
Publication of CN102426598A publication Critical patent/CN102426598A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a brand new method for clustering Chinese texts based on network content analysis. A clustering number and an initial central point for clustering are automatically confirmed according to a density-based clustering concept, and meanwhile, a convergence criterion for the clustering number is optimized and a complex rate of a clustering algorithm is reduced, thereby being capable of confirming the clustering number and the initial central point on a whole sample base, ensuring the clustering comprehensiveness, avoiding influence of excessive personal factors on a clustering result, and meanwhile, acquiring higher clustering accuracy and efficiency.

Description

A kind of method that is used for the Chinese text cluster of Web content safety management
Technical field
The present invention relates to a kind of method that is used for the Chinese text cluster of Web content safety management.
Background technology
Comprise text classification, text cluster technical research at Web content safety management application emphasis, the purpose of these two types of technology all is that large-scale text data object is divided into groups to form a plurality of classifications.Wherein text cluster is as a kind of unsupervised machine learning method; The technology implementation procedure need not more human factors such as preset document classification, the manual mark of classification and participates in; It is the major technique solution of effectively organizing, making a summary and navigating to the magnanimity text message; Become the research special topic that the magnanimity text message merges direction, information content safety management important applied field such as studied and judged for network public sentiment information supervision, trend and have significant technical support effect, actual application value.
Traditional information clustering method; Mainly can be divided into plane partitioning (partitioning method), hierarchical method (hierarchical method), based on the method (density-based method) of density, amount to five big types based on the method (grid-based method) of grid with based on the method (model-based method) of model; Do explanation with regard to the main representative algorithm of current text cluster below; Analyze its good and bad point, propose improved algorithm simultaneously on this basis.
The plane partitioning provides an initial group technology at first artificially; Change through the method that iterates later on and divide into groups; Till satisfying certain convergence criterion, this algorithm iteration speed is fast, can handle mass data effectively; But can't solve the problem of choosing of initial cluster center, the cluster number also can't accurately be confirmed.It can not find arbitrary shape bunch, choosing of its initial cluster center has very big influence to cluster result.
Stratification carries out decomposing like the similar level to given data set; Till satisfying certain convergence criterion; This clustering method is simpler, but it often runs into the difficulty of merging or split point selection, and algorithm complex is smaller; If but do not select well to merge and split point, then may cause low-quality cluster result.And this algorithm need check and estimate surely a large amount of objects or bunch, be not suitable for the cluster of mass data.
Based on the method for density is exactly just to be added to it in the close with it cluster and to go as long as the density of the point in zone is beaten certain threshold values; Can filter " noise " isolated point data like this; Find Any shape bunch; But it is very responsive to user-defined parameter, and different eps (neighborhood) and MinPts (object minimal amount number) will produce very big influence to the net result of cluster, to such an extent as to cause the huge cluster result of difference.
Data space is divided into the network of limited unit based on the method for grid; Cluster operation all carries out on this network (space that promptly quantizes); Processing speed is very fast, and its processing time is independent of the number of data object, and is only relevant with the number of unit of each dimension in the quantification space; Its cluster quality depends on the granularity of the network bottom; If granularity is thinner, the cost of processing can increase significantly, if but the granularity of the bottom too slightly will reduce the quality that text cluster is analyzed.
Attempt to optimize the match between given data and certain mathematical model based on the method for model, for model of each bunch supposition, seek the best-fit of data to given model, in the practice, its convergence is very fast, but possibly not reach global optimum.For some parameters optimization of giving definite form, convergence can guarantee.Its computation complexity linearity depends on d (input feature vector number), n (number of objects) and t (iterations).
Summary of the invention
The purpose of this invention is to provide and a kind ofly can guarantee the comprehensive of cluster, avoid of the influence of too much human factor, can obtain the method for the Chinese text cluster that is used for the Web content safety management of higher relatively cluster accuracy and efficient simultaneously again cluster result.
For the ease of saying something, understand two definition earlier:
The distance that defines between 1: two vector adopts Euclidean distance:
Figure 2011103501200100002DEST_PATH_IMAGE001
Wherein X=(xi1, xi2 ..., xip) and Y=(yi1, yi2 ..., yip) be the text vector of two P dimensions.
Definition 2: calculate the mean distance between the sample
Figure 738766DEST_PATH_IMAGE002
N is a total sample number;
Figure 2011103501200100002DEST_PATH_IMAGE003
is the number of combinations of getting two points in n the point, and
Figure 655906DEST_PATH_IMAGE004
is the distance between the data object.
A kind of method that is used for the Chinese text cluster of Web content safety management of the present invention, concrete steps are:
1, with document sets D={d1, d2 ..., each the document di among the dn} is as a bunch class Ci={di} with single member, and these bunches class constitutes the cluster C={c1 of D, c2 ..., cn};
2, adopt the mode of definition 1 calculate in twos bunch between Euclidean distance, form the distance matrix between the text vector;
3, according to the distance matrix that obtains, adopt definition 2 calculate all bunches between mean distance, be designated as R, simultaneously Φ=2*R;
4, being the center to each bunch class Ci={di}, is that radius is made ball with R, and the number that drops on the point in the ball is a density, calculates the density of each point;
5, the sample rate according to each point sorts, and finds bunch class of maximal density to be designated as C1;
6, be first cluster centre point with bunch C1, find out and satisfy the point of distance greater than Φ, promptly | C2-C1|>Φ is designated as the 2nd cluster centre point, finds the 3rd point | C3-C1|>Φ is designated as the 3rd cluster centre point; So circulation is up to looking for complete document sets D={d1, d2 ... Till the dn}, find k cluster centre point according to this, can confirm number and the central point Z1 of k like this; Z2 ..., Zk;
7, the K that obtains and K cluster centre Z1, Z2 ..., Zk adopts cluster k-means algorithm iteration as the initial center of K-means algorithm, till K cluster centre no longer changes, obtains K cluster like this.
Combine traditional K-means method to add the selection of improved cluster initial center like this, make the similarity degree of each text vector basis and cluster centre distance to form K mutually disjoint cluster, comparatively similar vector all gathers in same type.
The method that is used for the Chinese text cluster of Web content safety management of the present invention through confirming automatically cluster numbers and cluster initial center point based on the cluster thought of density, is optimized the convergence criterion of cluster numbers simultaneously; Reduce the complexity of clustering algorithm; Can on whole sample storehouse, confirm like this cluster numbers and initial center point to have guaranteed the comprehensive of cluster, avoid of the influence of too much human factor cluster result; It is fast to have iteration speed simultaneously; Can effectively handle the characteristics of large data sets, in the detection to the mass data clustering, accuracy rate and recall rate all have lifting preferably.
Description of drawings
Fig. 1 is the embodiments of the invention structural drawing.
Embodiment
According to above scheme, design a kind of computer system that adopts this clustering method, this computer system comprises:
Text collection, input device in order to system's input text information, and are numbered text;
Text library is in order to speech, characteristic, the vectorization result of storage text;
Text participle device is expressed as speech in order to the sentence with text;
Text feature extracts and the vectorization device, in order to the further vectorization of the text that will be expressed as speech, in order to cluster.
The text cluster device, finally corresponding one by one in order to text cluster according to text library with vectorization, generate the text cluster result.
Text collection, input device link to each other with text participle device, and text participle device is put with text library and the makeup of text feature extraction extreme vector and linked to each other, and text library extracts with text feature and the vectorization device links to each other with the text cluster device, and its cluster may further comprise the steps:
1, text participle device is with the target text Unified Treatment, and the text that sentence is expressed carries out participle;
2, text feature extracts and the vectorization device, and employing information is obtained the weighting word frequency algorithm tfidf algorithm that often uses in the research and carried out feature selecting, has increased the proper vector based on BNN, adopts general VSM mode to realize the document vectorization;
3, the text cluster device uses the method for describing in this patent that text set is carried out text cluster, finally exports cluster result;
With reference to Fig. 1, be preferred embodiment structural drawing of the present invention, gather automatically through user's input or internet; With text load module 10; Word segmentation processing is carried out in the input of 20 pairs of modules 10 of module, and with load module 30 as a result, module 30 is carried out text feature and extracted and vectorization; And deposit the result in module 40, simultaneously with the result as module 50 inputs.Module 50 is carried out the text cluster processing, and result is submitted to user 70 the most at last.

Claims (1)

1. method that is used for the Chinese text cluster of Web content safety management, it is characterized in that: it may further comprise the steps:
(1), with document sets D={d1, d2 ..., each the document di among the dn} is as a bunch class Ci={di} with single member, and these bunches class constitutes the cluster C={c1 of D, c2 ..., cn};
(2), adopt following manner calculate in twos bunch between Euclidean distance, form the distance matrix between the text vector,
Distance between two vectors adopts Euclidean distance:
Figure 2011103501200100001DEST_PATH_IMAGE002
Wherein X=(xi1, xi2 ..., xip) and Y=(yi1, yi2 ..., yip) be the text vector of two P dimensions;
(3), according to the distance matrix that obtains, adopt following manner calculate all bunches between mean distance, be designated as R, Φ=2*R simultaneously,
Calculate the mean distance between the sample
Figure 2011103501200100001DEST_PATH_IMAGE004
N is a total sample number;
Figure 2011103501200100001DEST_PATH_IMAGE006
is the number of combinations of getting two points in n the point, and
Figure 2011103501200100001DEST_PATH_IMAGE008
is the distance between the data object;
(4), be the center to each bunch class Ci={di}, be that radius is made ball with R, the number that drops on the point in the ball is a density, calculates the density of each point;
(5), sort, find bunch class of maximal density to be designated as C1 according to the sample rate of each point;
(6), be first cluster centre point with bunch C1, find out and satisfy the point of distance greater than Φ, promptly | C2-C1|>Φ is designated as the 2nd cluster centre point, finds the 3rd point | C3-C1|>Φ is designated as the 3rd cluster centre point; So circulation is up to looking for complete document sets D={d1, d2 ... Till the dn}, find k cluster centre point according to this, can confirm number and the central point Z1 of k like this; Z2 ..., Zk;
(7), the K that obtains and K cluster centre Z1, Z2 ..., Zk adopts cluster k-means algorithm iteration as the initial center of K-means algorithm, till K cluster centre no longer changes, obtains K cluster like this.
CN2011103501200A 2011-11-08 2011-11-08 Method for clustering Chinese texts for safety management of network content Pending CN102426598A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103501200A CN102426598A (en) 2011-11-08 2011-11-08 Method for clustering Chinese texts for safety management of network content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103501200A CN102426598A (en) 2011-11-08 2011-11-08 Method for clustering Chinese texts for safety management of network content

Publications (1)

Publication Number Publication Date
CN102426598A true CN102426598A (en) 2012-04-25

Family

ID=45960578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103501200A Pending CN102426598A (en) 2011-11-08 2011-11-08 Method for clustering Chinese texts for safety management of network content

Country Status (1)

Country Link
CN (1) CN102426598A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102916426A (en) * 2012-09-20 2013-02-06 中国电力科学研究院 Method for grouping small-interference steady generator sets based on data clustering, and system thereof
CN104636498A (en) * 2015-03-08 2015-05-20 河南理工大学 Three-dimensional fuzzy clustering method based on information bottleneck theory
CN107292193A (en) * 2017-05-25 2017-10-24 北京北信源软件股份有限公司 A kind of method and system for realizing leakage prevention
CN109389172A (en) * 2018-10-11 2019-02-26 中南大学 A kind of radio-signal data clustering method based on printenv grid
CN110244186A (en) * 2019-07-08 2019-09-17 国网天津市电力公司 A kind of cable fault prediction and alarm method based on Algorithm of Outliers Detection

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923712A (en) * 2010-08-03 2010-12-22 苏州大学 Particle swarm optimization-based gene chip image segmenting method of K-means clustering algorithm
US20110055212A1 (en) * 2009-09-01 2011-03-03 Cheng-Fa Tsai Density-based data clustering method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055212A1 (en) * 2009-09-01 2011-03-03 Cheng-Fa Tsai Density-based data clustering method
CN101923712A (en) * 2010-08-03 2010-12-22 苏州大学 Particle swarm optimization-based gene chip image segmenting method of K-means clustering algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨鑫华: "基于密度半径自适应选择的K-均值聚类算法", 《大连交通大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102916426A (en) * 2012-09-20 2013-02-06 中国电力科学研究院 Method for grouping small-interference steady generator sets based on data clustering, and system thereof
CN102916426B (en) * 2012-09-20 2015-01-21 中国电力科学研究院 Method for grouping small-interference steady generator sets based on data clustering, and system thereof
CN104636498A (en) * 2015-03-08 2015-05-20 河南理工大学 Three-dimensional fuzzy clustering method based on information bottleneck theory
CN107292193A (en) * 2017-05-25 2017-10-24 北京北信源软件股份有限公司 A kind of method and system for realizing leakage prevention
CN109389172A (en) * 2018-10-11 2019-02-26 中南大学 A kind of radio-signal data clustering method based on printenv grid
CN109389172B (en) * 2018-10-11 2022-05-20 中南大学 Radio signal data clustering method based on non-parameter grid
CN110244186A (en) * 2019-07-08 2019-09-17 国网天津市电力公司 A kind of cable fault prediction and alarm method based on Algorithm of Outliers Detection
CN110244186B (en) * 2019-07-08 2020-09-01 国网天津市电力公司 Cable fault prediction alarm method based on isolated point detection algorithm

Similar Documents

Publication Publication Date Title
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
Popat et al. Review and comparative study of clustering techniques
Peralta et al. Evolutionary feature selection for big data classification: A mapreduce approach
CN102289522B (en) Method of intelligently classifying texts
CN106096727A (en) A kind of network model based on machine learning building method and device
CN100495408C (en) Text clustering element study method and device
Kumar et al. Canopy clustering: a review on pre-clustering approach to k-means clustering
CN102426598A (en) Method for clustering Chinese texts for safety management of network content
CN102033949A (en) Correction-based K nearest neighbor text classification method
Fitriyani et al. The K-means with mini batch algorithm for topics detection on online news
CN111125469B (en) User clustering method and device of social network and computer equipment
CN105046323B (en) Regularization-based RBF network multi-label classification method
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
Li et al. Application of random-SMOTE on imbalanced data mining
CN111090811A (en) Method and system for extracting massive news hot topics
CN105184654A (en) Public opinion hotspot real-time acquisition method and acquisition device based on community division
Mu et al. DBSCAN-KNN-GA: a multi Density-Level Parameter-Free clustering algorithm
CN102567405A (en) Hotspot discovery method based on improved text space vector representation
CN104731811A (en) Cluster information evolution analysis method for large-scale dynamic short texts
Prince et al. An Imbalanced Dataset and Class Overlapping Classification Model for Big Data.
Jingbiao et al. Research and improvement of clustering algorithm in data mining
Wan et al. ICGT: A novel incremental clustering approach based on GMM tree
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
Inbarani et al. Hybrid tolerance rough set based intelligent approaches for social tagging systems
CN109885758A (en) A kind of recommended method of the novel random walk based on bigraph (bipartite graph)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120425