CN102523241B - Method and device for classifying network traffic on line based on decision tree high-speed parallel processing - Google Patents

Method and device for classifying network traffic on line based on decision tree high-speed parallel processing Download PDF

Info

Publication number
CN102523241B
CN102523241B CN201210006268.7A CN201210006268A CN102523241B CN 102523241 B CN102523241 B CN 102523241B CN 201210006268 A CN201210006268 A CN 201210006268A CN 102523241 B CN102523241 B CN 102523241B
Authority
CN
China
Prior art keywords
stream
decision tree
module
classification
bag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210006268.7A
Other languages
Chinese (zh)
Other versions
CN102523241A (en
Inventor
顾仁涛
许艳红
纪越峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201210006268.7A priority Critical patent/CN102523241B/en
Publication of CN102523241A publication Critical patent/CN102523241A/en
Application granted granted Critical
Publication of CN102523241B publication Critical patent/CN102523241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a method and a device for classifying network traffic on line based on decision tree high-speed parallel processing. The method comprises the following steps of: performing acquisition, distribution and manual classification on early real traffic data, extracting the packet characteristics of an early transmission control protocol (TCP) stream set, establishing a decision tree classification model, converting a data structure, performing distribution and class judgment on a data packet to be classified, tagging a current data packet, extracting the packet characteristics of a TCP stream to be classified, and searching for a decision tree. The device comprises a decision tree construction module, a structure conversion module, a classification result processing module, a medium access control (MAC) layer processing module, a data packet polling management module, a distribution judgment module, a traffic information extraction and tagging module, and a decision tree searching module. The method and the device are low in algorithm complexity and high in processing speed, classification accuracy and stability, and can be used for equipment and systems with requirements for online traffic classification in a high speed backbone network; and online classification can be realized.

Description

Based on network traffics online classification method and the device of the processing of decision tree high-speed parallel
Technical field
The present invention relates to a kind of method and device of network traffics online classification, relate in particular to a kind of method and device of realizing TCP flow online classification based on decision tree high-speed parallel processing policy, belong to communication technical field.
Background technology
Nowadays, the development of network technology is more and more rapider, and network application gets more and more, becomes increasingly complex.Increasing Internet resources are not only being seized in various application, and QoS and network security have been brought to huge threat.Under such background, a safe, reliable, efficient environment for use is provided how to vast Internet user, how to find and avoid the abnormal flow of network, be the major issue that field of network management need to solve.In order to solve above-mentioned these problems, network research personnel have proposed a series of strategies such as flow scheduling, capacity planning and have improved the efficiency of operation of network.But, no matter be that existing network is carried out to extending capacity reformation, still carry out QoS scheduling, all must classify accurately and identification to the various application in network traffics (as P2P, Web, IM, video flow etc.).In addition,, in research fields such as network security, charge on traffic, application trend analyses, traffic classification accurately is also extremely important.The rapid popularization of cable broadband and 3G/4G, more has broad application prospects this instrument that effectively carries out network fine-grained management of traffic classification.
The mainly port information based on transport layer of traditional traffic classification technology, but in recent years, improve constantly and application layer protocol gradually under complicated and diversified trend in Internet bandwidth, the correlation of many network applications and port is more and more less, the situations such as camouflage port and dynamic port make said method be difficult to the development and requirement of adaptive technique and application, this is just in the urgent need to introducing new theory and technology, and profound level is excavated the internal characteristics of network application.In order to adapt to, Internet data on flows is huge, the feature of apply property dynamic change, and utilizing machine learning method to process traffic classification problem becomes an emerging study hotspot in current network fields of measurement.For example: NB Algorithm, improvement bayesian algorithm, decision Tree algorithms, KNN algorithm, algorithm of support vector machine, neural network algorithm and various clustering algorithms etc.Traffic classification technology based on machine learning does not rely on transport layer port number or resolves pay(useful) load and carry out recognition network application, but utilize the various statistical natures of the stream that flow shows in transmitting procedure to carry out recognition network application as wrapped length, inter-packet gap time etc., method itself is not pretended port, dynamic port, pay(useful) load and is encrypted the even impact of network address translation, aspect classification performance and flexibility, all have breakthrough than aforementioned the whole bag of tricks.
But industry also cannot meet the demand of business development far away to the research of flow sorting technique at present, be mainly reflected in the means that most technology all adopts off-line to classify, cannot realize the classification of real-time online.This has just limited the application of traffic classification technology in high-speed backbone.
In order to meet the needs of current and following high-speed backbone, traffic classification technology is in the urgent need to meeting the following requirements: 1) classification accuracy is higher, avoids adopting port or payload as main recognition feature; 2) algorithm complex is lower, will have the characteristic of parallelization processing in specific implementation design, is easy to hardware and realizes (as FPGA, CPLD, ASIC etc.), ensures the high speed online classification of network traffics; 3) classification stability is better, can be applicable to network environment complicated and changeable.
Summary of the invention
The invention provides a kind of method and device of realizing TCP flow online classification based on decision tree high-speed parallel processing policy, can realize the high speed real-time online classification of network traffics, good stability, accuracy is high.
For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:
A method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy, is characterized in that comprising the following steps:
Step 1, in earlier stage collection, shunting and the manual sort of real traffic data: collection network real traffic data set, utilize five-tuple that data set is separated into different TCP stream, the set of TCP stream is carried out to manual sort, make each TCP stream all corresponding with a kind of protocol type.
Step 2, extract several bag features that early stage, TCP adfluxion was closed: extract the feature about packet in each TCP stream, and build preliminary characteristic sequence according to packet at the sequencing of this TCP stream, then according to feature selecting algorithm, the bag feature of preliminary extraction is processed, filtered out and best embody the bag feature of traffic category characteristic and form final characteristic sequence.
Step 3, the foundation of Decision-Tree Classifier Model: the final characteristic sequence that step 2 is formed, utilizes decision Tree algorithms to contribute.
Step 4, the decision tree of setting up in step 3 is carried out data structure conversion and stores hardware device into (as FPGA, CPLD, ASIC etc.) memory device (as RAM, ROM, FLASH etc.) in: by decision-making traversal of tree, extract on the one hand the middle node point value of decision tree, each middle node point value to same attribute carries out sequence from small to large, then each middle node point value of all properties is carried out to coding from small to large in order, extract on the other hand the fringe node value of decision tree, edge nodal value is equally also encoded, the coding of fringe node value is a scope, depend on the encoded radio that arrives each intermediate node of experiencing of this fringe node.Middle node point value and coding thereof and fringe node value and coding thereof are stored in respectively in the memory device (as RAM, ROM, FLASH etc.) of two separation.Wherein, before traffic classification device is used for network traffics online classification, sets up decision tree and decision tree is carried out to data structure conversion in the mode of processed offline.
Step 5, shunts and classification judgement packet to be sorted: according to five-tuple, packet is divided into not to homogeneous turbulence and searches stream information table and obtain classified information, stream information table is for the five-tuple information of recorded stream and the classification of this stream.Stream information table only needs to preserve the record of classified stream, does not need to preserve the record of non-classified stream, when therefore flow information table is searched, if it is unfiled not exist record can be judged as immediately, searches the time thereby save.
Step 6, current data packet is labelled and processes and extract the bag feature of TCP to be sorted stream: utilize classification information that step 5 extracts to the processing that labels of the packet of all processes, if the stream under packet is classified, stamp corresponding class label, if unfiled,, according to the label of certain acquiescence of principle mark, then judge whether this packet needs to be extracted bag feature and do respective handling.Here, the extraction of bag feature is corresponding with the final characteristic sequence adopting in step 2, need to extract by bag arrival order, and build the characteristic sequence of stream to be sorted, the order information of bag is arranged according to the time sequencing that arrives observation station, and first request bag of getting three-way handshake is Setup bag first bag as this stream.The characteristic sequence of stream to be sorted is stored in parameter list, and a record of parameter list comprises five-tuple, each bag characteristic value and the whether full mark of parameter.Bag and the unfiled bag of stamping default label and need to carrying out parameter extraction of being stamped correct label of having classified need to carry out clock synchronous processing, and the streamline that inserts appropriate level there will not be spillover to ensure the FIFO on data packet transmission route.
Step 7, decision tree is searched: utilize the characteristic sequence of the stream to be sorted of step 6 gained to search two of step 4 gained memory devices (as RAM, ROM, FLASH etc.), judge the class label of this TCP stream and upgrade stream information table.In search procedure, adopt parallel processing strategy, only need two clock cycle can complete the search procedure of decision tree.Be parallel relatively all middle node point values of all properties of first clock cycle, determine all intermediate node encoded radios that this stream is affiliated and merge into data, second clock cycle utilized the encoded radio of the parallel more all fringe nodes of result data of previous clock cycle, thereby determines the classification of this stream.
A kind of flow online classification device based on hardware device (as FPGA, CPLD, ASIC etc.), comprise online part and off-line part for realizing the above-mentioned device that utilizes decision tree high-speed parallel processing policy to realize TCP flow online classification, wherein off-line part has decision tree achievement module, decision tree structure modular converter, classification results processing module; Online part comprises that MAC layer processing module one, packet poll administration module, shunting judge module, flow information extraction and the module that labels, the decision tree that are linked in sequence search module, MAC layer processing module two.Wherein, the decision tree structure modular converter of off-line part is searched module with the decision tree of online part and is connected, and the classification results processing module of off-line part is searched module with the decision tree of online part and is indirectly connected by a stream information table.Online part adopts pipeline processes technology.
Decision tree achievement module, for according to early stage live network data traffic set up decision-tree model.
Decision tree structure modular converter, for the structure of decision-tree model is changed, make it to become and be easy to hard-wired another kind of data structure and store hardware device into (as FPGA, CPLD, ASIC etc.) memory device (as RAM, ROM, FLASH etc.) in: by decision-making traversal of tree, extract on the one hand the middle node point value of decision tree, each middle node point value to same attribute carries out sequence from small to large, then each middle node point value of all properties is carried out to coding from small to large in order, extract on the other hand the fringe node value of decision tree, edge nodal value is equally also encoded, the coding of fringe node value is a scope, depend on the encoded radio that arrives each intermediate node of experiencing of this fringe node.Middle node point value and coding thereof and fringe node value and coding thereof are stored in respectively in the memory device (as RAM, ROM, FLASH etc.) of two separation.
Nowadays MAC layer processing module have many ready-made IP kernels to adopt.
Packet poll administration module, for from N packet buffer queue read data packet, adopts polling type to access each input rank herein, just forwards next queue to until run through a complete packet from a queue.
Shunting judge module, for packet being divided into not homogeneous turbulence according to five-tuple, judging whether this stream is classified, if be classified, class label is how many, and safeguards stream information table.Stream information table is for the five-tuple information of recorded stream and the classification of this stream.Stream information table is only searched module by decision tree and is upgraded processing, and other modules all can not be carried out write operation by flow information table.
Flow information extraction and the module that labels, for flowing feature extraction and to the processing that labels of all packets to non-classified packet.This module need to be safeguarded a parameter list, and parameter list records the whether full mark of five-tuple information, characteristic value and characteristic value of each stream.The parameter information of the full stream of characteristic value is sent to decision tree and searches module.
Decision tree is searched module, for the characteristic sequence of stream to be sorted that utilizes flow information extraction and the module that labels sends, two memory devices (as RAM, ROM, FLASH etc.) of decision tree structure modular converter gained are searched, judged the class label of this TCP stream and upgrade stream information table.In search procedure, adopt parallel processing strategy, only need two clock cycle can complete the search procedure of decision tree.Be all middle node point values of first clock cycle parallel search all properties, determine all intermediate node encoded radios that this stream is affiliated and merge into data, second clock cycle utilized the encoded radio of all fringe nodes of result data parallel search of previous clock cycle, thereby determine the classification of this stream, and classification results is sent to stream information table upgrades with the record in flow information table.
Classification results processing module, for the result to traffic classification gather, processing and interface display.
Therefore, traffic classification method and apparatus provided by the invention, has the following advantages: the structure to decision tree is changed, and makes it to convert to one and is easy to hard-wired data structure, has reduced the complexity of algorithm itself; In decision tree search procedure, use parallel search and pipelining, improved processing speed; The bag characteristic extraction procedure of choosing is simple, is easy to complete online; The accuracy of having utilized decision tree itself to have is high, the feature of good stability.In a word, the traffic classification result of lower algorithm complex, efficient hardware canbe used on line mode, accurate stable has formed greatest feature of the present invention.
Brief description of the drawings
In order to be illustrated more clearly in the present invention, below the accompanying drawing of required use during the embodiment of the present invention is described is briefly described, apparently, accompanying drawing in the following describes is only the structural representation of traffic classification method flow diagram of the present invention, traffic classification device, for those of ordinary skill in the art, do not paying under creative work prerequisite the more accompanying drawing that can also obtain according to these accompanying drawings.
Fig. 1 is the traffic classification method flow diagram that one embodiment of the invention provides;
Fig. 2 is the structural representation of the traffic classification device that provides of one embodiment of the invention;
Embodiment
Below in conjunction with the accompanying drawing of the embodiment of the present invention, the technical scheme in the embodiment of the present invention and device are clearly and completely described.This embodiment is described in detail as an example of C4.5 algorithm in decision tree example, but the method is applicable to other decision Tree algorithms equally.This embodiment realizes based on FPGA, and the memory device of employing is RAM, is equally applicable to other hardware devices (as FPGA, CPLD, ASIC etc.) and memory device (as RAM, ROM, FLASH etc.).Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to protection scope of the present invention.
The flow chart of the traffic classification method that Fig. 1 provides for one embodiment of the invention, the method comprises:
S101: collect multiple network flow data collection at different time from different location, and data set is shunted and manual sort.
Net flow assorted device is generally deployed in certain network.Decision Tree algorithms need to utilize the real traffic in network to train, and to build Decision-Tree Classifier Model, therefore need in the network of preparing for deployment, network probe be set, to gather real traffic from network.Above-mentioned real traffic data set comprises for manual sort determines the information that flow protocol type is required, and the required characteristic parameter of subsequent step such as long data packet, inter-packet gap time, bag direction; From transport layer protocol, in data flow, at least comprise TCP stream and UDP stream.Data on flows collection need to be separated into different TCP stream according to { source address, destination address, source port, destination interface, transport layer protocol type } five-tuple by gained data on flows collection, data on flows collection has just become the set of TCP stream like this; Wherein, the basis for estimation of the head of TCP stream can be used but not limited to Setup, Setup/ACK, the ack msg bag in TCP stream; And in a data flow, packet must be arranged according to the sequencing that reaches observation station.Adopt the methods such as loading analysis, obtain the protocol type of tcp data stream with offline mode, as WWW, MAIL, FTP, P2P etc.
S102: extract several bag features that early stage, TCP adfluxion was closed.
Extract the feature about packet in each TCP stream, extract bag length, inter-packet gap time and the bag direction of transfer of 10 bags of every stream head.In this step, the bag feature of every stream of extraction has three major types, a class be bag long attribute, a class be inter-packet gap attribute, another kind of be wrap direction.Namely the bag of first bag bag long, second bag bag long, the 3rd bag is long ... the bag of the tenth bag is long; The interval time of interval time (being 0), second bag and the interval time of first bag, the 3rd bag and second bag of first bag and zero bag .... the interval time of the tenth bag and the 9th bag; The direction of the direction of first bag, the direction of second bag, the 3rd bag ... the direction of the tenth bag.And build preliminary characteristic sequence according to packet at the sequencing of this TCP stream, and then utilize feature selecting algorithm to screen bag feature, obtain final characteristic sequence.
S103: set up Decision-Tree Classifier Model.
Utilize high-level programming language as Java, Matlab etc. or directly utilize machine learning Weka software, according to live network data traffic in early stage, the C4.5 algorithm based on classical is set up decision-tree model.
S104: decision tree is carried out structure conversion and stored in the RAM of FPGA.
By to decision-making traversal of tree, extract on the one hand the middle node point value of decision tree, traversal from top to bottom and the extraction of middle node point value are carried out in all paths.On one paths, may have the intermediate node based on different attribute, the intermediate node of same attribute also may be present on different paths.All middle node point values to one tree extract rear summarized results, then each middle node point value of same attribute are carried out to sequence from small to large, then each middle node point value of all properties are carried out to coding from small to large in order.For example, if the middle node point value of the long attribute of bag is respectively 70,80,90,100,110 from small to large, coding figure place is 3, middle node point value 70 is encoded to 000, middle node point value 80 is encoded to 001, middle node point value 90 is encoded to 010, and middle node point value 100 is encoded to 011, and middle node point value 110 is encoded to 100; The coding of inter-packet gap attribute and bag direction attribute middle node point value similarly.The fringe node value of extracting on the other hand decision tree, edge nodal value is equally also encoded, and the coding of fringe node value is a scope, depends on the encoded radio that arrives each intermediate node that the path of this fringe node experiences.Middle node point value and coding thereof and fringe node value and coding thereof are stored in respectively in the RAM of two separation.Wherein, before traffic classification device is used for network traffics online classification, sets up decision tree and decision tree is carried out to data structure conversion in the mode of processed offline.
The storage organization of two block RAMs is as shown in table 1, table 2.Record of a behavior in table, the encoded radio of each record middle node point value of storage and this middle node point value.Wherein, n1 represents the intermediate node sum of the 1st attribute, and n2 represents the intermediate node sum of the 2nd attribute, and nk represents the intermediate node sum of k attribute, the sum of m presentation protocol type.
Table 1 middle node point value and encoded radio thereof the storage mode in RAM
Middle node point value Intermediate node encoded radio
The 1st middle node point value of the 1st attribute The 1st intermediate node encoded radio of the 1st attribute
The 2nd middle node point value of the 1st attribute The 2nd intermediate node encoded radio of the 1st attribute
N1 middle node point value of the 1st attribute N1 intermediate node encoded radio of the 1st attribute
The 1st middle node point value of the 2nd attribute The 1st intermediate node encoded radio of the 2nd attribute
The 2nd middle node point value of the 2nd attribute The 2nd intermediate node encoded radio of the 2nd attribute
N2 middle node point value of the 2nd attribute N2 intermediate node encoded radio of the 2nd attribute
…………………… ………………………..
The 1st middle node point value of k attribute The 1st intermediate node encoded radio of k attribute
The 2nd middle node point value of k attribute The 2nd intermediate node encoded radio of k attribute
Nk middle node point value of k attribute Nk intermediate node encoded radio of k attribute
Table 2 nodal value and encoded radio thereof the storage mode in RAM
Nodal value (being class label) Nodes encoding value
The 1st kind of protocol class offset The 1st kind of protocol type encoded radio scope
The 2nd kind of protocol class offset The 2nd kind of protocol type encoded radio scope
…………………… …………………….
M kind protocol class offset M kind protocol type encoded radio scope
S105: packet to be sorted is shunted
According to { source address, destination address, source port, destination interface, transport layer protocol type } five-tuple, packet is divided into not to homogeneous turbulence and safeguards stream information table, stream information table is for the five-tuple information of recorded stream and the classification of this stream.At this, only the TCP stream of complete semantic is analyzed.With shaking hands for 3 times as the beginning of stream of TCP, the end using the FIN=1 of TCP or RST=1 as stream.Determine whether a stream according to the five-tuple information of message in network { source address, destination address, source port, destination interface, transport layer protocol type }.If five-tuple is identical, belong to same stream.Otherwise, be homogeneous turbulence not.Wherein, if the source address of two bags is identical, belong to network flow in the same way; If source address is identical with destination address, belong to reverse network flow; And agreement, the up direction taking the routing direction of first message as this network flow.In addition,, if two message intervals exceed certain hour, belong to different network flows.In stream information table, each record comprises following content: the ID, { source address, destination address, source port, destination interface, transport layer protocol type } five-tuple of a stream of mark, the protocol type identifying.Stream information table only needs to preserve the record of classified stream, does not need to preserve the record of non-classified stream, when therefore flow information table is searched, if it is unfiled not exist record can be judged as immediately, searches the time thereby save.
S106: judge whether the TCP stream under this packet classifies
The five-tuple information of utilizing the packet of S105 extraction, flow information table is searched, and sees the corresponding record of stream that whether has had this five-tuple representative in table, if there is record, read the class label of this stream, if there is no record, this stream is not classified.
S107: classified packet is stamped to correct label
Utilize classification information that step S106 obtains to the processing that labels of the packet of all processes, if the stream under packet is classified, stamp corresponding class label, classification finishes.
S108: non-classified packet is stamped default label and extracted the bag feature that TCP to be sorted flows
For non-classified packet, according to the label of certain acquiescence of principle mark, then judge whether this packet needs to be extracted bag feature and do respective handling.Here, the extraction of bag feature is corresponding with the final characteristic sequence adopting in S102, certain attribute or some attribute that extract certain bag or some bag are consistent with the final characteristic sequence in S102, need to extract by bag arrival order, and build the characteristic sequence of stream to be sorted.Similar with stream information table, flow information extraction module also will be safeguarded a parameter list, and in parameter list, each record comprises following content: ID, source address, destination address, source port, destination interface, the transport layer protocol type of a stream of mark } whether five-tuple, certain bag are long with the bag of interval time of previous bag, certain bag, the bag direction of certain bag, this stream parameter full scale will.
Network data (being the frame in transmission of data packets) transmission is impregnable, reason is that flow information extraction module is as a data probe, only the parameter information of passing by this module is copied out, and do not change the transmission time sequence of any data and data.
S109, decision tree is searched.
Utilize the characteristic sequence of the stream to be sorted of S108 gained to search two block RAMs of S104 gained, judge the class label of this TCP stream and upgrade stream information table.In search procedure, adopt parallel processing strategy, only need two clock cycle can complete the search procedure of decision tree.Be parallel relatively all middle node point values of all properties of first clock cycle, determine all intermediate node encoded radios that this stream is affiliated.That is to say, first clock cycle need to complete the comparison of n1 middle node point value of the 1st attribute to determine the 1st the middle node point value scope interval under attribute, complete the comparison of n2 middle node point value of the 2nd attribute to determine the 2nd the middle node point value scope interval under attribute ... complete the comparison of nk middle node point value of k attribute to determine k the fringe node value scope interval under attribute, and this n1+n2+ ... + nk comparator is to walk abreast to start to carry out simultaneously.After first clock cycle finishes, can determine the intermediate node scope of the affiliated all properties of this stream, can determine all intermediate node encoded radios under this stream by the record in RAM simultaneously, the corresponding intermediate node encoded radio of one of them attribute, corresponding k intermediate node encoded radio of a stream, this k intermediate node encoded radio merged into data, the more all fringe node encoded radios of amalgamation result data parallel that second clock cycle utilized the previous clock cycle, thereby determine the fringe node value of this stream, namely protocol class value.
Fig. 2 is the structural representation of traffic classification device provided by the present invention.
From function, this traffic classification device can be divided into online and two parts of off-line.Off-line part mainly completes structure and the data structure conversion of decision tree; The main classification of being responsible for unknown traffic of online part.Off-line part comprises the Primary Stage Data flow collection module 201, Primary Stage Data diverting flow module 202, Primary Stage Data turning flow artificial sort module 203, Primary Stage Data stream characteristic extracting module 204, decision tree achievement module 205, decision tree structure modular converter 206 and the classification results processing module 207 in later stage that are linked in sequence; Online part comprises that MAC layer processing module 1, packet poll administration module 212, shunting judge module 213, flow information extraction and the module 214 that labels, the decision tree that are linked in sequence search module 215, MAC layer processing module 2 216.
In this traffic classification device, Primary Stage Data flow collection module 201, Primary Stage Data diverting flow module 202, Primary Stage Data turning flow artificial sort module 203, Primary Stage Data stream characteristic extracting module 204, decision tree achievement module 205, decision tree structure modular converter 206 can complete before device is disposed, and are not therefore the device of use traffic classification or the necessary component of system.And MAC layer processing module 1, packet poll administration module 212, shunting judge module 213, flow information extraction and the module 214 that labels, decision tree are searched module 215, MAC layer processing module 2 216, classification results processing module 207 and generally should occur in the device of use traffic classification or system.
Each module concrete function and handling process are as follows: before the device with traffic classification or system use, need to use Primary Stage Data flow collection module 201, Primary Stage Data diverting flow module 202, Primary Stage Data turning flow artificial sort module 203,, Primary Stage Data stream characteristic extracting module 204, decision tree achievement module 205, decision tree structure modular converter 206 complete the work of S101~S104 in Fig. 1, the decision tree data structure through conversion of formation is placed in the RAM of device.When the packet of a unknown classification enters after traffic classification device, and MAC layer processing module 1, packet poll administration module 212 are carried out preliminary treatment to packet, shunting judge module 213 is divided into packet not homogeneous turbulence and safeguards stream information table according to { source address, destination address, source port, destination interface, transport layer protocol type } five-tuples, then flow information table is searched with the classification under specified data bag, completes the work of S105~S106 in Fig. 1.Flow information extraction and the module 214 that labels according to the classification information obtained of shunting judge module 213 to the packet processing that labels, press packet sequencing simultaneously, extract successively the parameters such as bag is long, correction inter-packet gap time, direction of transfer, form characteristic sequence, send into the work that decision tree is searched module 215, completes S107~S109 in Fig. 1.Stream information table is only searched module 215 by decision tree and is upgraded processing, and other modules all can not be carried out write operation by flow information table.MAC layer processing module 2 216 and classification results processing module 207 are carried out follow-up processing and show classification results packet.
The method and apparatus that the present embodiment provides, has carried out data structure conversion to the decision tree that adopts C4.5 algorithm to set up, and makes it to convert to one and is easy to hard-wired data structure, has reduced the complexity of algorithm itself; In decision tree search procedure, use parallel search and pipelining, improved processing speed; The bag characteristic extraction procedure of choosing is simple, is easy to complete online; The accuracy of having utilized C4.5 algorithm itself to have is high, the feature of good stability.Therefore, the present embodiment can be realized the high speed online classification of network traffics easily.
Finally it should be noted that: above embodiment only, in order to technical scheme of the present invention and device to be described, is not intended to limit; Although the present invention is had been described in detail with reference to previous embodiment, those of ordinary skill in the art is to be understood that; Its technical scheme that still can record aforementioned each embodiment is modified, or technical characterictic is wherein equal to replacement; And these amendments or replacement do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (9)

1. a method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy, is characterized in that comprising the following steps:
Step 1, in earlier stage collection, shunting and the manual sort of real traffic data: collection network real traffic data set, utilize five-tuple that data set is separated into different TCP stream, the set of TCP stream is carried out to manual sort, make each TCP stream all corresponding with a kind of protocol type;
Step 2, extract several bag features that early stage, TCP adfluxion was closed: extract the feature about packet in each TCP stream, and build preliminary characteristic sequence according to packet at the sequencing of this TCP stream, and then bag feature is screened, obtain final characteristic sequence;
Step 3, the foundation of C4.5 Decision-Tree Classifier Model: the final characteristic sequence that step 2 is formed, utilizes C4.5 decision Tree algorithms to contribute;
Step 4, the decision tree of setting up in step 3 is carried out data structure conversion and stored in the memory device of hardware device: by decision-making traversal of tree, extract on the one hand the branch node value of decision tree, each branch node value to same attribute is carried out sequence from small to large, then each branch node value of all properties is carried out to coding from small to large in order, extract on the other hand the leaf node value of decision tree, leaf node value is equally also encoded, the coding of leaf node value is a scope, depend on the encoded radio that arrives each branch node of experiencing of this leaf node, branch node value and coding thereof and leaf node value and coding thereof are stored in respectively in the memory device of two separation,
Step 5, shunts and classification judgement packet to be sorted: according to five-tuple, packet is divided into not to homogeneous turbulence and searches stream information table and obtain classified information, stream information table is for the five-tuple information of recorded stream and the classification of this stream;
Step 6, current data packet is labelled and processes and extract the bag feature of TCP to be sorted stream: utilize classification information that step 5 extracts to the processing that labels of the packet of all processes, if the stream under packet is classified, stamp corresponding class label, if unfiled,, according to the label of certain acquiescence of principle mark, then judge whether this packet needs to be extracted bag feature and do respective handling; Here, the extraction of bag feature is corresponding with the final characteristic sequence adopting in step 2, need to extract by bag arrival order, and build the characteristic sequence of stream to be sorted, the characteristic sequence of stream to be sorted is stored in parameter list, and a record of parameter list comprises five-tuple, each bag characteristic value and the whether full mark of parameter;
Step 7, decision tree is searched: utilize the characteristic sequence of the stream to be sorted of step 6 gained to search two of step 3 gained memory devices, judge the class label of this TCP stream and upgrade stream information table; This step adopts the mode of parallel search and the structure of streamline to improve seek rate, in the situation that not considering other read-writes and clock synchronous processing, only need two clock cycle can complete the search procedure of decision tree, be parallel relatively all branch node values of all properties of first clock cycle, determine all branch node encoded radios that this stream is affiliated and merge into data, second clock cycle utilized the encoded radio of the parallel more all leaf nodes of result data of previous clock cycle, thereby determines the classification of this stream.
2. the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy according to claim 1, is characterized in that:
The wherein foundation of C4.5 Decision-Tree Classifier Model, is the bag feature that the collection network real traffic data set that obtains based on step 1 and step 2 obtain, and is set up C4.5 decision tree and decision tree is carried out to data structure changed by processed offline.
3. the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy according to claim 1, is characterized in that:
In described step 2, need process several bag features of preliminary extraction according to feature selecting algorithm, filter out the bag feature that best embodies traffic category characteristic.
4. the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy according to claim 1, is characterized in that:
In described step 5, stream information table only needs to preserve the record of classified stream, does not need to preserve the record of non-classified stream, when therefore flow information table is searched, if it is unfiled not exist record can be judged as immediately, searches the time thereby save.
5. the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy according to claim 1, is characterized in that:
In described step 6, bag is arranged according to bag arrival order, first request bag of getting three-way handshake is Setup bag first bag as this stream.
6. the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy according to claim 1, is characterized in that:
In described step 6, bag and the unfiled bag of stamping default label and need to carrying out parameter extraction of being stamped correct label of having classified need to carry out clock synchronous processing, and the streamline that inserts appropriate level there will not be spillover to ensure the FIFO on data packet transmission route.
7. a TCP flow online classification device, for realizing the method that realizes TCP flow online classification based on decision tree high-speed parallel processing policy as claimed in claim 1, this device comprises online part and off-line part, and wherein off-line part has C4.5 decision tree achievement module, C4.5 decision tree structure modular converter, classification results processing module; Online part comprises that MAC layer processing module one, packet poll administration module, shunting judge module, flow information extraction and the module that labels, the C4.5 decision tree that are linked in sequence search module, MAC layer processing module two;
C4.5 decision tree achievement module, for according to early stage live network data traffic set up decision-tree model;
C4.5 decision tree structure modular converter, for the structure of decision-tree model is changed, make it to become and be easy to hard-wired another kind of data structure and store in the memory device of hardware device: by decision-making traversal of tree, extract on the one hand the branch node value of decision tree, each branch node value to same attribute is carried out sequence from small to large, then each branch node value of all properties is carried out to coding from small to large in order, extract on the other hand the leaf node value of decision tree, leaf node value is equally also encoded, the coding of leaf node value is a scope, depend on the encoded radio that arrives each branch node of experiencing of this leaf node, branch node value and coding thereof and leaf node value and coding thereof are stored in respectively in the memory device of two separation,
Nowadays MAC layer processing module have many ready-made IP kernels to adopt;
Packet poll administration module, for from N packet buffer queue read data packet, adopts polling type to access each input rank herein, just forwards next queue to until run through a complete packet from a queue;
Shunting judge module, for packet being divided into not homogeneous turbulence according to five-tuple, judging whether this stream is classified, and is how many if be classified class label, and safeguards stream information table, stream information table is for the five-tuple information of recorded stream and the classification of this stream;
Flow information extraction and the module that labels, for flowing feature extraction and to the processing that labels of all packets to non-classified packet; This module need to be safeguarded a parameter list, and parameter list records the whether full mark of five-tuple information, characteristic value and characteristic value of each stream; The parameter information of the full stream of characteristic value is sent to C4.5 decision tree and searches module;
C4.5 decision tree is searched module, for the characteristic sequence of stream to be sorted that utilizes flow information extraction and the module that labels sends, two memory devices of decision tree structure modular converter gained is searched, and judges the class label of this TCP stream and upgrades stream information table; In search procedure, adopt parallel processing strategy, only need two clock cycle can complete the search procedure of decision tree, be parallel relatively all branch node values of all properties of first clock cycle, determine all branch node encoded radios that this stream is affiliated and merge into data, second clock cycle utilized the encoded radio of the parallel more all leaf nodes of result data of previous clock cycle, thereby determine the classification of this stream, and classification results is sent to stream information table upgrades with the record in flow information table;
Classification results processing module, for the result to traffic classification gather, processing and interface display.
8. TCP flow online classification device according to claim 7, is characterized in that:
The C4.5 decision tree structure modular converter of off-line part is searched module with the C4.5 decision tree of online part and is connected, and the classification results processing module of off-line part is searched module with the C4.5 decision tree of online part and is indirectly connected by a stream information table.
9. TCP flow online classification device according to claim 7, is characterized in that:
Described off-line part also has Primary Stage Data flow collection module, Primary Stage Data diverting flow module, Primary Stage Data turning flow artificial sort module, Primary Stage Data stream characteristic extracting module.
CN201210006268.7A 2012-01-09 2012-01-09 Method and device for classifying network traffic on line based on decision tree high-speed parallel processing Active CN102523241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210006268.7A CN102523241B (en) 2012-01-09 2012-01-09 Method and device for classifying network traffic on line based on decision tree high-speed parallel processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210006268.7A CN102523241B (en) 2012-01-09 2012-01-09 Method and device for classifying network traffic on line based on decision tree high-speed parallel processing

Publications (2)

Publication Number Publication Date
CN102523241A CN102523241A (en) 2012-06-27
CN102523241B true CN102523241B (en) 2014-11-19

Family

ID=46294033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210006268.7A Active CN102523241B (en) 2012-01-09 2012-01-09 Method and device for classifying network traffic on line based on decision tree high-speed parallel processing

Country Status (1)

Country Link
CN (1) CN102523241B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9444730B1 (en) 2015-11-11 2016-09-13 International Business Machines Corporation Network traffic classification

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102904890A (en) * 2012-10-12 2013-01-30 哈尔滨工业大学深圳研究生院 State detection method for cloud data packet header
CN103209169B (en) * 2013-02-23 2016-03-09 北京工业大学 A kind of network traffics filtration system based on FPGA and method
CN104125106A (en) * 2013-04-23 2014-10-29 中国银联股份有限公司 Network purity detection device and method based on classified decision tree
EP3025522B1 (en) * 2013-12-13 2019-10-23 Telefonaktiebolaget LM Ericsson (publ) Traffic coordination for communication sessions involving wireless terminals and server devices
WO2016128989A1 (en) * 2015-02-10 2016-08-18 Telefonaktiebolaget Lm Ericsson (Publ) A method and apparatus for data mediation
CN105162663B (en) * 2015-09-25 2019-02-19 中国人民解放军信息工程大学 A kind of online method for recognizing flux based on adfluxion
CN106408007A (en) * 2016-09-07 2017-02-15 国家电网公司 Power communication network flow classification method and system
CN106572486B (en) * 2016-10-17 2020-11-27 湖北大学 Handheld terminal flow identification method and system based on machine learning
CN106975617B (en) * 2017-04-12 2018-10-23 北京理工大学 A kind of Classification of materials method based on color selector
CN108109702A (en) * 2017-07-04 2018-06-01 大连大学 The data selecting method of application size flow point class
CN108304164B (en) * 2017-09-12 2021-12-03 马上消费金融股份有限公司 Business logic development method and development system
CN108229573B (en) * 2018-01-17 2021-05-25 北京中星微人工智能芯片技术有限公司 Classification calculation method and device based on decision tree
CN109086815B (en) * 2018-07-24 2021-08-31 中国人民解放军国防科技大学 Floating point number discretization method in decision tree model based on FPGA
CN109063777B (en) * 2018-08-07 2019-12-03 北京邮电大学 Net flow assorted method, apparatus and realization device
CN109246095B (en) * 2018-08-29 2019-06-21 四川大学 A kind of communication data coding method suitable for deep learning
CN109784370A (en) * 2018-12-14 2019-05-21 中国平安财产保险股份有限公司 Data map generation method, device and computer equipment based on decision tree
CN110445689B (en) * 2019-08-15 2022-03-18 平安科技(深圳)有限公司 Method and device for identifying type of equipment of Internet of things and computer equipment
CN112887300B (en) * 2021-01-22 2022-02-01 北京交通大学 Data packet classification method
CN113240036B (en) * 2021-05-28 2023-11-07 北京达佳互联信息技术有限公司 Object classification method and device, electronic equipment and storage medium
CN113360740B (en) * 2021-06-04 2022-10-11 上海天旦网络科技发展有限公司 Data packet labeling method and system
CN113810311A (en) * 2021-09-14 2021-12-17 北京左江科技股份有限公司 Data packet classification method based on multiple decision trees
CN114900474B (en) * 2022-05-05 2023-08-22 鹏城实验室 Data packet classification method, system and related equipment for programmable switch
CN116226893B (en) * 2023-05-09 2023-08-01 北京明苑风华文化传媒有限公司 Client marketing information management system based on Internet of things
CN116521963A (en) * 2023-07-04 2023-08-01 北京智麟科技有限公司 Method and system for processing calculation engine data based on componentization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5870735A (en) * 1996-05-01 1999-02-09 International Business Machines Corporation Method and system for generating a decision-tree classifier in parallel in a multi-processor system
CN101184097A (en) * 2007-12-14 2008-05-21 北京大学 Method of detecting worm activity based on flux information
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
CN102271090A (en) * 2011-09-06 2011-12-07 电子科技大学 Transport-layer-characteristic-based traffic classification method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7039762B2 (en) * 2003-05-12 2006-05-02 International Business Machines Corporation Parallel cache interleave accesses with address-sliced directories
US8290882B2 (en) * 2008-10-09 2012-10-16 Microsoft Corporation Evaluating decision trees on a GPU

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5870735A (en) * 1996-05-01 1999-02-09 International Business Machines Corporation Method and system for generating a decision-tree classifier in parallel in a multi-processor system
CN101184097A (en) * 2007-12-14 2008-05-21 北京大学 Method of detecting worm activity based on flux information
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
CN102271090A (en) * 2011-09-06 2011-12-07 电子科技大学 Transport-layer-characteristic-based traffic classification method and device

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers,Inc.,1993;Steven L.Salzberg;《machine learning》;19940930;第16卷(第3期);第235-240页 *
fast traffic classification in high speed networks;Rentao Gu 等;《challenges for next generation network operation and service management》;20081231;第429-432页 *
Rentao Gu 等.fast traffic classification in high speed networks.《challenges for next generation network operation and service management》.2008,第429-432页. *
Steven L.Salzberg.C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers,Inc.,1993.《machine learning》.1994,第16卷(第3期),第235-240页. *
数据挖掘中决策树算法的最新进展;韩慧 等;《计算机应用研究》;20041231;第21卷(第12期);第5-8页 *
数据挖掘的并行策略研究;颜雪松 等;《计算机工程与应用》;20030131;第39卷(第3期);第187-189页 *
韩慧 等.数据挖掘中决策树算法的最新进展.《计算机应用研究》.2004,第21卷(第12期), *
颜雪松 等.数据挖掘的并行策略研究.《计算机工程与应用》.2003,第39卷(第3期), *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9444730B1 (en) 2015-11-11 2016-09-13 International Business Machines Corporation Network traffic classification
US9596171B1 (en) 2015-11-11 2017-03-14 International Business Machines Corporation Network traffic classification
US9882807B2 (en) 2015-11-11 2018-01-30 International Business Machines Corporation Network traffic classification
US9942135B2 (en) 2015-11-11 2018-04-10 International Business Machines Corporation Network traffic classification

Also Published As

Publication number Publication date
CN102523241A (en) 2012-06-27

Similar Documents

Publication Publication Date Title
CN102523241B (en) Method and device for classifying network traffic on line based on decision tree high-speed parallel processing
CN105871832B (en) A kind of network application encryption method for recognizing flux and its device based on protocol attribute
CN102315974B (en) Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN104270392B (en) A kind of network protocol identification method learnt based on three grader coorinated trainings and system
US8797901B2 (en) Method and its devices of network TCP traffic online identification using features in the head of the data flow
CN101252541B (en) Method for establishing network flow classified model and corresponding system thereof
CN104102700A (en) Categorizing method oriented to Internet unbalanced application flow
CN102394827A (en) Hierarchical classification method for internet flow
CN1652519A (en) Communication measuring system and its communication analyzing method
CN104144089A (en) BP-neural-network-based method for performing traffic identification
CN101645806A (en) Network flow classifying system and network flow classifying method combining DPI and DFI
CN105141455B (en) A kind of net flow assorted modeling method of making an uproar based on statistical nature
CN112822189A (en) Traffic identification method and device
CN107317758B (en) High-reliability fine-grained SDN flow monitoring framework
Vinayakumar et al. Secure shell (ssh) traffic analysis with flow based features using shallow and deep networks
CN108462707A (en) A kind of mobile application recognition methods based on deep learning sequence analysis
CN108460423B (en) Service identification method based on SDN architecture
CN114915575B (en) Network flow detection device based on artificial intelligence
CN115118653A (en) Real-time service traffic classification method and system based on multi-task learning
CN101764754B (en) Sample acquiring method in business identifying system based on DPI and DFI
CN102648604A (en) Method of monitoring network traffic by means of descriptive metadata
Hayes et al. Online identification of groups of flows sharing a network bottleneck
CN101674192B (en) Method for identifying VoIP based on flow statistics
CN108141377A (en) Network flow early stage classifies
Amina et al. Featuring real-time imbalanced network traffic classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant