US20090276427A1 - Method of Extracting Sections of a Data Stream - Google Patents

Method of Extracting Sections of a Data Stream Download PDF

Info

Publication number
US20090276427A1
US20090276427A1 US12/505,147 US50514709A US2009276427A1 US 20090276427 A1 US20090276427 A1 US 20090276427A1 US 50514709 A US50514709 A US 50514709A US 2009276427 A1 US2009276427 A1 US 2009276427A1
Authority
US
United States
Prior art keywords
section
sequences
extracted
data stream
sections
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/505,147
Inventor
Neil Duxbury
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roke Manor Research Ltd
Original Assignee
Roke Manor Research Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0700926A external-priority patent/GB2445763A/en
Priority claimed from GB0700928A external-priority patent/GB0700928D0/en
Application filed by Roke Manor Research Ltd filed Critical Roke Manor Research Ltd
Assigned to ROKE MANOR RESEARCH LIMITED reassignment ROKE MANOR RESEARCH LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: DUXBURY, NEIL
Publication of US20090276427A1 publication Critical patent/US20090276427A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • This invention relates to a method of extracting sections of a data stream.
  • a particular example of searching data streams is in SPAM filtering where it is desirable to extract data having a particular label, or end point identifier, such as an email address, a domain name, a uniform resource locator, or telephone number.
  • a method of extracting sections of a data stream comprises determining a combination of at least two sequences of the set; comparing the combination of sequences with sequences in the data stream; and rejecting or accepting extraction of the section of the data stream based upon the result of the comparison; wherein if the combination of sequences does not include a start and end marker for the section, a search for the start and end markers is carried out before the section is extracted.
  • the present invention provides a high performance generic extraction framework which allows data stream content to be processed at high speed and used in a real time context.
  • extraction of the section is accepted if the combination of sequences in any order matches stored sequences in the section of the data stream.
  • extraction of the section is rejected if the combination of sequences does not match any of the sequences in the section of the data stream; and thereafter the search continues for further instances of the combination of sequences in another section.
  • a sequence comprises a series of bits having a predetermined format, such as an anchor, or a bridge.
  • the anchor is a statistically rare, or low probability sequence in the data stream.
  • the probability of occurrence is less than about 1%.
  • the combination of sequences comprises an anchor and a sequence adjacent to the anchor.
  • the combination of sequences comprises at least the first and last sequence of the section.
  • the combination of sequences comprises more than one sequence associated with an anchor; wherein the combination of anchor and sequences to form the section is determined; and wherein the section is only extracted if all sequences forming the section are present.
  • searches for combinations of sequences are carried out in parallel on different sections of the data stream.
  • each sequence comprises a series of bits of data, or multiple bytes of data.
  • the section comprises an end point identifier, such as a domain name; an email address; a uniform resource locator; or a telephone number.
  • an end point identifier such as a domain name; an email address; a uniform resource locator; or a telephone number.
  • Choosing a particular type of end point identifier allows a large amount of irrelevant data to be immediately discarded without having to search for a specific instance. For example, a SPAM filter could search for the domain name structure, so data lacking that format would not need to be considered.
  • each sequence is encoded in a separate state machine and multiple state machines are combined to represent the section.
  • a bridge provides a transition between separate state machines representing the sequences of the section.
  • the method further comprises filtering the extracted sections of the data stream; the filtering comprising determining a set of characters of interest; testing each section of the data stream for the presence of one or more of the set of characters of interest; and extracting sections in which at least one of the characters is present.
  • filtering is carried out to reduce the number of results more specifically, such as only emails having “.roke.” in their address.
  • the method further comprises determining a further set of characters of interest; testing for at least one character from the further set of characters in the portion of the data stream; and extracting sections in which at least one of the characters from the further sets of characters is also present in the section.
  • This step can be repeated until the amount of data which needs to be tested for a complete match is reduced to a reasonable amount.
  • the extracted sections are stored in a store and extracted as and when needed.
  • the extracted sections are input to a comparison stage; compared with specific examples of end point identifiers; and discarded if the section does not match a specific example in the comparison stage.
  • FIG. 1 is a block diagram of a typical system to which the method of the present invention is applied;
  • FIG. 2 illustrates domain name and DNIV state machines
  • FIG. 3 illustrates state machines when used with the ‘.’ anchor point
  • FIG. 4 illustrates state machine modifications for diagram operation
  • FIG. 5 illustrates an example of extracting a page title
  • FIG. 6 shows an example of searching for a hyperlink
  • FIG. 7 shows an example of a search for a data and time format
  • FIGS. 8A-8D illustrate exemplary extraction and filtering arrangements for electronic mail in accordance with the present invention.
  • FIGS. 9A-9D illustrate exemplary extraction and filtering arrangements for URLs in accordance with the present invention.
  • the present invention describes a technique which allows structural forms of data to be identified and extracted, such as identifying and extracting data based on it being a domain name, an email address, or a data and time format.
  • Other examples include, in search engine indexing automating the process of document retrieval and classification, e.g. if using a web spider for extraction of hyperlinks from html documents in order to construct a list of URLs to subsequently retrieve. Given the vast quantities of html content available on the Internet efficient extraction of hyperlinks from web pages is required.
  • Another example is use in real time SPAM classification. Part of SPAM classification involves the identification of URLs/URLs, domain names or email addresses associated with SPAM objects. Such identification is used with whitelist/blacklists of SPAM items to filter out SPAM content. Due to the large quantities of SPAM present in modern communications networks, an efficient identification and filtering of SPAM content is desired.
  • a section of data, typically representing an end point identifier, label, or meta-data, which section is to be identified and extracted, is broken down by encoding each subsection of the format within an individual state machine. Particular characters can then be used as bridges to move between one state machine and another, where a bridge character is used to move between the different machines describing a meta-data format.
  • a complete format is defined by creating a number of smaller machines that describe each subsection of the format. The machines are then used with the bridges to create a super machine that describes the entire format. Complete traversal of the super machine from its start state to its terminal state is used to identify the end point identifier format.
  • Anchors are signatures that are associated with the label of interest, in particular, single characters or sequences of characters that are statistically rare in free text, or binary data. This property can be used to quickly lock on to a location in free text that has a higher than average probability of being a subpart of the label of interest.
  • a hyperlink can be identified by recognising the domain name part of the format.
  • the domain name part of the hyperlink can be described using the following syntax:
  • [ ] square brackets are used to signify one or more optional components.
  • DNIV this is the set of characters that are illegal within the domain name part.
  • domain this is the set of character that are legal within the domain name part.
  • the dot symbol is a bridge between two domain name parts.
  • DNIV domain name parts of the syntax
  • DNIV is also defined by the expression-!domain.
  • FIG. 1 illustrates a typical system for operating the method of the present invention.
  • An input data stream 30 which could be from a store (not shown), or a real time data source, is input to a processor 31 which applies the method of the present invention.
  • the section is output 39 to a store 32 , or output 40 to a comparison stage 33 , such as a look up table.
  • Data which is not extracted is discarded 34 , although the discarded data steam could be subjected to additional tests, for example for an alternative label, or end user identifier.
  • the extracted sections of data may be stored before an optional filtering step 35 is applied and the sections which are filtered out can be returned to the store, or sent on for further processing in the comparison stage 33 . Sections which are not extracted in the filter stage 35 are discarded 36 .
  • the output 38 of the extracted and optionally, filtered data stream may be obtained from the store 32 , or as an output 39 from the comparison stage 33 .
  • the label or end point identifier which is used to determine which sections of the data stream are extracted is made up of parts, some of which may be statistically rarer than others in free text. Consequently, an effective method to increase the practical performance of the identification algorithm is to look for these parts before the others.
  • These parts known as anchor points, can be used to ‘lock on’ to a position in the data stream that may be an instance of the end point identifier type sought.
  • validation of the data is carried out by parsing outwards (forward and backwards) around the anchor point.
  • the ‘.’ symbols are statistically rarer in free text than the other characters contained in the domain name format.
  • This modification splits the domain name algorithm into two distinct machines as shown in FIG. 3 a and FIG. 3 b .
  • the identification algorithm first finds the signature ‘.domain’ using the machine defined in FIG. 3 a and then starting at the ‘.’ position in the data stream moves backwards and applies the smaller state machine defined in FIG. 3 b .
  • the domain name part is validated first as failure at any point allows the algorithm to continue moving forward through the data stream without expending unnecessary effort on validating the smaller part. From start point, start.
  • FIG. 3A moves from left to right starting at point 41
  • FIG. 3B moves from right to left starting at 41 . So for the pattern roke.co.uk, FIG. 3A would find the part ‘.co.uk’ at character position 5 . FIG. 3B would then start at position 5 and move from right to left to find the part ‘roke’. The pattern roke.co.uk is then subsequently extracted.
  • a valid character 148 takes us from start domain name 41 to the next state 149 .
  • an invalid domain name character 150 identifies the start of the complete pattern 151 (i.e. start domain name or the ‘r’ in roke.co.uk).
  • a valid domain name character 152 loops back on itself.
  • a dot 153 indicates another sub-domain and moves us to the next state 154 . From here a valid domain name character 155 moves us back and an invalid domain name character 156 results in failure 157 .
  • the meta-data format is defined as a collection of bytes.
  • modern processors have register sizes that are multiple bytes wide.
  • the machine register size can be exploited by adapting the state machines so that the state machine transitions are labelled with multi byte values rather than single byte values. In this instance the input byte stream is processed multiple bytes at a time instead of a single byte at a time.
  • the multi-byte state machine runs multiple instances of the single byte state machine each starting at different byte offset, i.e. the throughput is increased by processing the data in multiple machines operating in parallel.
  • FIG. 4 An example of a simplified ‘.domain’ state machine that processes two bytes at a time is shown in FIG. 4 .
  • the machine is entered when any of the 16 bit patterns defined by Ch d . or .Ch d is found.
  • Ch d Ch d means a valid domain name character followed by a valid domain name character.
  • Ch d !Ch d means a valid domain name character followed by an invalid domain name character.
  • the term ! Ch d Ch d means an invalid domain name character followed by a valid domain name character.
  • Ch d . means a valid domain name character followed by a dot character.
  • the term Ch d means a dot character followed by a valid domain name character.
  • the machine is started by finding a pair of bytes defined by either of the following sequences Ch d . or .Ch d 50 followed by a valid domain name that satisfies this version of the domain name state machine.
  • the algorithm no longer looks for the ‘.’ symbol specifically but searches for a 16 bit sequence containing the ‘.’ symbol.
  • This modification also has the advantage that a 16 bit sequence containing an ‘.’ is statistically rarer than a bare‘.’ symbol. Consequently, the algorithm rejects a larger fraction of potential alignments by enforcing the formatting of the characters around the ‘.’.
  • the machine is started by finding a pair of bytes defined by either of the following sequences, Ch d . or .Ch d 50 and in this case the test moves to the next point 51 .
  • the search moves to the next point 54 .
  • the next two bytes are Ch d Ch d 53
  • the search moves to the next point 54 .
  • the next two bytes are Ch d . or .Ch d 55 the search moves back to point 51 .
  • the next two bytes are Ch d Ch d
  • the next two bytes are any of the following Ch d !
  • next two bytes are Ch d . or .Ch d 63 then a domain name has been found 69 .
  • the search moves to point 54 .
  • the search moves back to point 51 .
  • the invention uses a set of state machines to describe the format of an end point identifier, label or meta-data.
  • a super machine is created by linking the smaller machines using bridge characters. Anchor points may be defined in the format, so these are identified first to increase throughput.
  • a further feature is that multi-byte versions of the state machines may be defined to enable the input to be processed in parallel. Rather than process the byte stream 8 bits at a time a pointer is used to access the data several bytes at a time. Each vertex of the machine is labelled using a multi byte value. The value of the sequence of bytes pointed at by the pointer is then used to traverse the vertices of the machine. This means that several bytes of the input are processed for each transition of the machine which improves the throughput. In effect this can be thought of as running several single character machines in parallel i.e. the state machine design exploits the machine word size to enable parallel processing in software.
  • the labels are separated by a sequence of characters from the valid set of characters that can be used within a URL.
  • the example is shown in FIG. 5
  • a symbol from the set ChURL (the set of valid URL characters) 82 takes the search to point 85 .
  • a symbol that is not in the set ChURL (!ChURL) 81 takes the search to point 83 and the search fails.
  • a valid URL character 86 loops the search back to point 85 .
  • an invalid URL character 84 results in failure 83 .
  • the quote character 87 takes the search to point 88 . At this point a valid hyperlink has been found and can be extracted.
  • the labels are separated by a sequence of characters from the set A-Z, a-z, 0-9 as illustrated in FIG. 6
  • the sequence ⁇ title> 71 takes the search to point 72 .
  • the characters A-Z, a-z, 0-9 ( 73 ) loop the search back to point 72 .
  • the symbols in the set !(A-Z, a-z, 0-9)! ( ⁇ /title>) 76 take the search to point 77 and the search fails.
  • the sequence ⁇ /title>74 takes the search to point 75 and the end.
  • search may be for a Date-Time format.
  • the pattern is:
  • the month can be one from the set of patterns Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. NUM indicates one of the characters 0-9 and !(NUM) means not one of the characters 0-9.
  • a bridge character is needed to link the date and time parts.
  • a suitable bridge is the SPACE character after the year. The example is shown in FIG. 7 .
  • a valid month 90 moves the search to point 91 .
  • any character 92 takes the search to point 93 .
  • any character loops the search back to point 93 .
  • the SPACE character 95 takes the search to point 96 .
  • any character 97 takes the search to point 98 .
  • any character 99 loops the search back to point 98 .
  • the sequence: NUMNUM!(NUM) 100 completes the search 101 .
  • the present invention allows sections of data to be identified and extracted. Although the examples have been described using hyperlinks and domain names, the invention can be applied to many other end user identifier types including email address identification; URI/URL identification; Session Initiation Protocol (SIP) URI identification; E.164 telephone number detection; tag detection in other data formats; IP addresses, port range, protocol and session identifier detection; xml data structures, xml objects; HTML structures and objects; and detection of content types and identification of content from packet payloads.
  • the basic method can be improved to increase throughput and processing speed by use of an anchor structure, or looking for an ngram containing an anchor symbol.
  • FIGS. 8A-8D illustrate exemplary extraction and filtering arrangements for electronic mail in accordance with the present invention.
  • a separate filtering and extraction server e.g., SPAM filter
  • SPAM filter 805 A can be executed by an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to perform the extraction and filtering functions.
  • ASIC application specific integrated circuit
  • the SPAM filter 805 A is coupled to an e-mail server 810 A, which in turn is coupled to a terminal 815 A.
  • Terminal 815 A can be any type of terminal, including a desktop computer, laptop computer and/or a wireless computing device (e.g., a wireless telephone and/or e-mail device).
  • Terminal 815 A includes an e-mail client 820 A for receiving the e-mails that pass from SPAM filter 805 A through e-mail server 810 A to terminal 815 A.
  • Terminal 815 A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to execute the e-mail client.
  • the e-mails can be output on printer 825 , display 830 or any other type of output device.
  • an e-mail is filtered then it would not be provided to terminal 815 A, whereas those that are not filtered would be provided to the terminal.
  • the e-mails that are discarded by SPAM filter 805 A are those that are passed to the terminal, whereas those that are output from lookup table 6 are filtered and not passed to the terminal.
  • SPAM filter 805 B can be included in e-mail server 810 B.
  • SPAM filter 805 B can be a separate program on the same hardware as the e-mail server 810 B and/or can be a program executing within the e-mail server program.
  • SPAM filter 805 C can be included in terminal 815 C.
  • SPAM filter 805 C can be a program executing on terminal 815 C.
  • SPAM filter 805 D can be included in e-mail client 820 D.
  • SPAM filter 805 D can be, for example, a plug-in for e-mail client 820 D.
  • FIGS. 9A-9D illustrate exemplary extraction and filtering arrangements for URLs in accordance with the present invention.
  • a separate server 905 A is provided for performing the extraction and filtering described above.
  • Server 905 A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to perform the extraction and filtering functions.
  • the extraction and filtering server 905 A is coupled to a web server 910 A, which in turn is coupled to a terminal 915 A.
  • Terminal 915 A can be any type of terminal, including a desktop computer, laptop computer and/or a wireless computing device (e.g., a wireless telephone and/or e-mail device).
  • Terminal 915 A includes a browser client 920 A for browsing web pages that pass from the extraction and filtering server 905 A through web server 910 A to client 915 A.
  • Terminal 915 A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to execute the e-mail client.
  • the web pages can be output on printer 925 , display 930 or any other type of output device. In particular, if a web page passes through the filter then it would not be provided to terminal 815 A, whereas those that are not filtered would be provided to the terminal.
  • the web pages that are discarded by the extraction and filtering server 805 C are those that are passed to the terminal, whereas those that are output from lookup table 6 are filtered and not passed to the terminal.
  • extraction and filtering server 905 B can be included in web server 910 B.
  • server 905 B can be a separate server executing on the same hardware as the web server 910 B and/or can be a program executing within the web server program.
  • extraction and filtering server 905 C can be included in terminal 915 C.
  • server 905 C can be a separate server or can be a program executing on terminal 915 C.
  • extraction and filtering server 905 D can be included in browser client 920 D.
  • server 905 D can be, for example, a plug-in for browser client 920 D.
  • FIGS. 8A-8D and 9 A- 9 D are described in connection with so-called blacklists, in which a match with the lookup table causes the email or web page to be excluded and not delivered to the terminal, the present invention can also be implemented with so-called whitelists. In this case a match with the lookup table allows the email or web page to be delivered to the terminal and a failure to match with the lookup table excludes the email or web page from being delivered.
  • the designation of the filtering and extraction element as being a server is used to cover a variety of different arrangements, including a physical server, a server program, a regular executable program and a plug-in program. Accordingly, the term server should be interpreted accordingly in connection with the claims.

Abstract

A method of extracting sections of a data stream, the sections including a set of sequences. Each sequence is encoded separately and coupled together to define a section. The method involves determining a combination of at least two sequences of the set, comparing the combination of sequences with sequences in the data stream, and rejecting or accepting extraction of the section of the data stream based upon the result of the comparison. If the combination of sequences does not include a start and end marker for the section, a search for the start and end markers is carried out before the section is extracted.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of PCT International Application No. PCT/GB2008/000184, filed Jan. 18, 2008, which claims priority under 35 U.S.C. § 119 to Great Britain Patent Application No. 0700926.9, filed Jan. 18, 2007, and Great Britain Patent Application No. 0700928.5, filed Jan. 18, 2007, the entire disclosures of the aforementioned applications are herein expressly incorporated by reference.
  • The present application is also related to U.S. patent application Ser. No. ______, entitled “A Method of Filtering Sections of a Data Stream” and filed on even date herewith, which is a continuation of PCT International Application No. PCT/GB2008/000172, filed Jan. 18, 2008, which claims priority under 35 U.S.C. § 119 to Great Britain Patent Application No. 0700926.9, filed Jan. 18, 2007, and Great Britain Patent Application No. 0700928.5, filed Jan. 18, 2007, the entire disclosures of the aforementioned applications are herein expressly incorporated by reference.
  • BACKGROUND AND SUMMARY OF THE INVENTION
  • This invention relates to a method of extracting sections of a data stream.
  • There are many instances where a user wishes to find and extract only certain data types from a larger body of data. The data is typically presented as a data stream, whether from a store, or in real time, and if all of the data were processed fully, this would be very slow.
  • A particular example of searching data streams is in SPAM filtering where it is desirable to extract data having a particular label, or end point identifier, such as an email address, a domain name, a uniform resource locator, or telephone number.
  • In accordance with the present invention, a method of extracting sections of a data stream, the sections comprising a set of sequences, wherein each sequence is encoded separately and coupled together to define the section, comprises determining a combination of at least two sequences of the set; comparing the combination of sequences with sequences in the data stream; and rejecting or accepting extraction of the section of the data stream based upon the result of the comparison; wherein if the combination of sequences does not include a start and end marker for the section, a search for the start and end markers is carried out before the section is extracted.
  • The present invention provides a high performance generic extraction framework which allows data stream content to be processed at high speed and used in a real time context.
  • Preferably, extraction of the section is accepted if the combination of sequences in any order matches stored sequences in the section of the data stream.
  • Preferably, extraction of the section is rejected if the combination of sequences does not match any of the sequences in the section of the data stream; and thereafter the search continues for further instances of the combination of sequences in another section.
  • Preferably, a sequence comprises a series of bits having a predetermined format, such as an anchor, or a bridge.
  • Preferably, the anchor is a statistically rare, or low probability sequence in the data stream.
  • Typically, the probability of occurrence is less than about 1%.
  • Preferably, the combination of sequences comprises an anchor and a sequence adjacent to the anchor.
  • This improves throughput by reducing the likelihood of a match.
  • Preferably, the combination of sequences comprises at least the first and last sequence of the section.
  • This allows the section to be extracted immediately if a match is found, whereas a successful match with a combination of sequences which does not include both start and end points requires the additional step of identifying these before extracting the section.
  • In one embodiment, the combination of sequences comprises more than one sequence associated with an anchor; wherein the combination of anchor and sequences to form the section is determined; and wherein the section is only extracted if all sequences forming the section are present.
  • This has the effect of only extracting sections where there is a complete match.
  • Preferably, searches for combinations of sequences are carried out in parallel on different sections of the data stream.
  • This could be by splitting the data stream, or looking for different combinations of sequences in the same part of the data stream.
  • Preferably, each sequence comprises a series of bits of data, or multiple bytes of data.
  • Preferably, the section comprises an end point identifier, such as a domain name; an email address; a uniform resource locator; or a telephone number.
  • Choosing a particular type of end point identifier allows a large amount of irrelevant data to be immediately discarded without having to search for a specific instance. For example, a SPAM filter could search for the domain name structure, so data lacking that format would not need to be considered.
  • Preferably, each sequence is encoded in a separate state machine and multiple state machines are combined to represent the section.
  • This makes the method more flexible.
  • Preferably, a bridge provides a transition between separate state machines representing the sequences of the section.
  • This allows the super state machine to be built up.
  • Preferably, the method further comprises filtering the extracted sections of the data stream; the filtering comprising determining a set of characters of interest; testing each section of the data stream for the presence of one or more of the set of characters of interest; and extracting sections in which at least one of the characters is present.
  • Having extracted sections which satisfy a minimum requirement, for example having a domain name format, then filtering is carried out to reduce the number of results more specifically, such as only emails having “.roke.” in their address.
  • Preferably, the method further comprises determining a further set of characters of interest; testing for at least one character from the further set of characters in the portion of the data stream; and extracting sections in which at least one of the characters from the further sets of characters is also present in the section.
  • This step can be repeated until the amount of data which needs to be tested for a complete match is reduced to a reasonable amount.
  • Although, all the processing steps could be carried out in real time, preferably, the extracted sections are stored in a store and extracted as and when needed.
  • Preferably, the extracted sections are input to a comparison stage; compared with specific examples of end point identifiers; and discarded if the section does not match a specific example in the comparison stage.
  • BRIEF DESCRIPTION OF THE DRAWING FIGURES
  • An example of a method of extracting sections of a data stream will now be described with reference to the accompanying drawings in which:
  • FIG. 1 is a block diagram of a typical system to which the method of the present invention is applied;
  • FIG. 2 illustrates domain name and DNIV state machines;
  • FIG. 3 illustrates state machines when used with the ‘.’ anchor point;
  • FIG. 4 illustrates state machine modifications for diagram operation;
  • FIG. 5 illustrates an example of extracting a page title;
  • FIG. 6 shows an example of searching for a hyperlink;
  • FIG. 7 shows an example of a search for a data and time format;
  • FIGS. 8A-8D illustrate exemplary extraction and filtering arrangements for electronic mail in accordance with the present invention; and
  • FIGS. 9A-9D illustrate exemplary extraction and filtering arrangements for URLs in accordance with the present invention.
  • DETAILED DESCRIPTION
  • The present invention describes a technique which allows structural forms of data to be identified and extracted, such as identifying and extracting data based on it being a domain name, an email address, or a data and time format. Other examples include, in search engine indexing automating the process of document retrieval and classification, e.g. if using a web spider for extraction of hyperlinks from html documents in order to construct a list of URLs to subsequently retrieve. Given the vast quantities of html content available on the Internet efficient extraction of hyperlinks from web pages is required. Another example is use in real time SPAM classification. Part of SPAM classification involves the identification of URLs/URLs, domain names or email addresses associated with SPAM objects. Such identification is used with whitelist/blacklists of SPAM items to filter out SPAM content. Due to the large quantities of SPAM present in modern communications networks, an efficient identification and filtering of SPAM content is desired.
  • A section of data, typically representing an end point identifier, label, or meta-data, which section is to be identified and extracted, is broken down by encoding each subsection of the format within an individual state machine. Particular characters can then be used as bridges to move between one state machine and another, where a bridge character is used to move between the different machines describing a meta-data format. Thus, a complete format is defined by creating a number of smaller machines that describe each subsection of the format. The machines are then used with the bridges to create a super machine that describes the entire format. Complete traversal of the super machine from its start state to its terminal state is used to identify the end point identifier format. Anchors are signatures that are associated with the label of interest, in particular, single characters or sequences of characters that are statistically rare in free text, or binary data. This property can be used to quickly lock on to a location in free text that has a higher than average probability of being a subpart of the label of interest.
  • For example of the present invention may be described with respect to identification and extraction of a hyperlink consisting of a sequence of characters followed by a domain name e.g. href=http://www.roke.co.uk. In general a hyperlink can be identified by recognising the domain name part of the format. The domain name part of the hyperlink can be described using the following syntax:
  • DNIV domain.domain[.domain] DNIV.
  • Within this syntax the following subgroups are identified:
  • [ ]—square brackets are used to signify one or more optional components.
    DNIV—this is the set of characters that are illegal within the domain name part.
    domain—this is the set of character that are legal within the domain name part.
    .—the dot symbol is a bridge between two domain name parts.
  • In general the set of characters that compose the DNIV, and domain name parts of the syntax are defined by the standards for internet based computer names. DNIV is also defined by the expression-!domain.
  • FIG. 1 illustrates a typical system for operating the method of the present invention. An input data stream 30 which could be from a store (not shown), or a real time data source, is input to a processor 31 which applies the method of the present invention. Whenever a section of the data stream satisfies the test criteria, the section is output 39 to a store 32, or output 40 to a comparison stage 33, such as a look up table. Data which is not extracted is discarded 34, although the discarded data steam could be subjected to additional tests, for example for an alternative label, or end user identifier. For convenience, the extracted sections of data may be stored before an optional filtering step 35 is applied and the sections which are filtered out can be returned to the store, or sent on for further processing in the comparison stage 33. Sections which are not extracted in the filter stage 35 are discarded 36. Thus, the output 38 of the extracted and optionally, filtered data stream may be obtained from the store 32, or as an output 39 from the comparison stage 33.
  • The mechanism for extracting sections of the data stream is described in more detail with respect to FIGS. 2 and 3. Let a single valid domain name character be Chd, the term !Chd means not in the set Chd then an example of a possible state machine for the domain name is defined in FIG. 2. In the example the ‘.’ symbols are examples of bridge characters. The ‘.’ character is used as a bridge between the sub-domains of the complete domain name.
  • From start domain name 1, if a valid domain name character Chd 2 is identified, the test moves on to the next point 3. If an invalid character 4, or bridge character 5, are found, the test fails 6. From point 3, an invalid character 7 causes a fail 8 and a valid character 9 loops back on itself, but a bridge character 10 moves the test on to the next point 11. From point 11 a bridge character 12, or an invalid character 13 cause a fail 6, whereas a valid character 14 moves on to the next point 15. A bridge character 16 moves to point 11, a valid character 17 loops back on itself to point 15 and an invalid character 18 moves to the end point, enddomain name 19. For startDNIV, an invalid character moves the test to endDNIV (not shown). Having determined a start and end point for the domain name, the series of sequences making up this section of the data stream can be extracted for storage, or further processing. In the state machine the domain name format is identified in a left to right fashion as the text is examined. However, in principal the sub parts of the format can be identified in any order.
  • The label or end point identifier which is used to determine which sections of the data stream are extracted is made up of parts, some of which may be statistically rarer than others in free text. Consequently, an effective method to increase the practical performance of the identification algorithm is to look for these parts before the others. These parts, known as anchor points, can be used to ‘lock on’ to a position in the data stream that may be an instance of the end point identifier type sought.
  • Once an anchor point has been found in the data stream, validation of the data is carried out by parsing outwards (forward and backwards) around the anchor point. For the domain name example the ‘.’ symbols are statistically rarer in free text than the other characters contained in the domain name format. This modification splits the domain name algorithm into two distinct machines as shown in FIG. 3 a and FIG. 3 b. The identification algorithm first finds the signature ‘.domain’ using the machine defined in FIG. 3 a and then starting at the ‘.’ position in the data stream moves backwards and applies the smaller state machine defined in FIG. 3 b. The domain name part is validated first as failure at any point allows the algorithm to continue moving forward through the data stream without expending unnecessary effort on validating the smaller part. From start point, start.domain name 41, a bridge character 42 moves the test to the next point 43, where an invalid character 44 causes the test to fail 45 and a valid character 46 moves on to the next point 47. From here the process steps and results are the same as for the equivalent reference numbers in FIG. 2. From point 43 a bridge character 48 moves back to start.domain name 41. The machine in FIG. 3A moves from left to right starting at point 41, whereas the machine in FIG. 3B moves from right to left starting at 41. So for the pattern roke.co.uk, FIG. 3A would find the part ‘.co.uk’ at character position 5. FIG. 3B would then start at position 5 and move from right to left to find the part ‘roke’. The pattern roke.co.uk is then subsequently extracted.
  • The series of steps in FIG. 3B starts at the same position in the text as point 41, a valid character 148 takes us from startdomain name 41 to the next state 149. From this state 149 an invalid domain name character 150 identifies the start of the complete pattern 151 (i.e. startdomain name or the ‘r’ in roke.co.uk). A valid domain name character 152 loops back on itself. A dot 153 indicates another sub-domain and moves us to the next state 154. From here a valid domain name character 155 moves us back and an invalid domain name character 156 results in failure 157.
  • Finally performance can be further improved by exploiting the machine word size. The meta-data format is defined as a collection of bytes. However, modern processors have register sizes that are multiple bytes wide. The machine register size can be exploited by adapting the state machines so that the state machine transitions are labelled with multi byte values rather than single byte values. In this instance the input byte stream is processed multiple bytes at a time instead of a single byte at a time. Thus, in effect the multi-byte state machine runs multiple instances of the single byte state machine each starting at different byte offset, i.e. the throughput is increased by processing the data in multiple machines operating in parallel.
  • An example of a simplified ‘.domain’ state machine that processes two bytes at a time is shown in FIG. 4. Starting the state machine at the upper most arc in FIG. 4, the machine is entered when any of the 16 bit patterns defined by Chd. or .Chd is found. Let a single valid domain name character be Chd, the term !Chd means not in the set Chd. The term ChdChd means a valid domain name character followed by a valid domain name character. The term Chd!Chd means a valid domain name character followed by an invalid domain name character. The term ! ChdChd means an invalid domain name character followed by a valid domain name character. The term Chd. means a valid domain name character followed by a dot character. The term Chd means a dot character followed by a valid domain name character.
  • The machine is started by finding a pair of bytes defined by either of the following sequences Chd. or .Ch d 50 followed by a valid domain name that satisfies this version of the domain name state machine.
  • Thus, the algorithm no longer looks for the ‘.’ symbol specifically but searches for a 16 bit sequence containing the ‘.’ symbol. This modification also has the advantage that a 16 bit sequence containing an ‘.’ is statistically rarer than a bare‘.’ symbol. Consequently, the algorithm rejects a larger fraction of potential alignments by enforcing the formatting of the characters around the ‘.’.
  • The machine is started by finding a pair of bytes defined by either of the following sequences, Chd. or .Ch d 50 and in this case the test moves to the next point 51. At point 51 if the next two bytes are Chd. or .Chd the search loops back on itself 52. At point 51 if the next two bytes are ChdChd 53 the test moves to the next point 54. At point 54 if the next two bytes are Chd. or .Ch d 55 the search moves back to point 51. At point 54 if the next two bytes are ChdChd the search loops back on itself 56. At point 54 if the next two bytes are any of the following Chd! Chd or !ChdChd or !Chd!Ch d 57 the search has failed 58. At point 54 if the next two bytes are Chd. or .Ch d 59 then the search moves to point 60. At point 60 if the next two bytes are Chd. or .Ch d 61 then the search moves to point 51. At point 60 if the next two bytes are ChdChd 62 then the search loops back on itself. At point 60 if the next two bytes are Chd!Chd or !ChdChd or !Chd ! Ch d 64 then a domain name has been found 65. At point 60 if the next two bytes are Chd. or .Ch d 63 then a domain name has been found 69. At point 69 if the next two bytes are ChdChd 66 then the search moves to point 54. At point 69 if the next two bytes are Chd. or .Ch d 67 then the search moves back to point 51.
  • In summary, the invention uses a set of state machines to describe the format of an end point identifier, label or meta-data. A super machine is created by linking the smaller machines using bridge characters. Anchor points may be defined in the format, so these are identified first to increase throughput. A further feature is that multi-byte versions of the state machines may be defined to enable the input to be processed in parallel. Rather than process the byte stream 8 bits at a time a pointer is used to access the data several bytes at a time. Each vertex of the machine is labelled using a multi byte value. The value of the sequence of bytes pointed at by the pointer is then used to traverse the vertices of the machine. This means that several bytes of the input are processed for each transition of the machine which improves the throughput. In effect this can be thought of as running several single character machines in parallel i.e. the state machine design exploits the machine word size to enable parallel processing in software.
  • More generally, in the example of searching for a hyperlink_. The pattern is:
  • href=“http://URL”
  • In this case the pair of labels are:
  • href=“http://and”
  • The labels are separated by a sequence of characters from the valid set of characters that can be used within a URL. The example is shown in FIG. 5
  • Starting at point 78, the sequence href=“http://79 takes the search to point 80. From point 80 a symbol from the set ChURL (the set of valid URL characters) 82 takes the search to point 85. From point 80 a symbol that is not in the set ChURL (!ChURL) 81 takes the search to point 83 and the search fails. From point 85 a valid URL character 86 loops the search back to point 85. From point 85 an invalid URL character 84 results in failure 83. From point 85 the quote character 87 takes the search to point 88. At this point a valid hyperlink has been found and can be extracted.
  • When searching a page for a title, having a pattern
  • <title> page title </title>
  • In this case the pair of labels are:
  • <title> and </title>
  • The labels are separated by a sequence of characters from the set A-Z, a-z, 0-9 as illustrated in FIG. 6
  • Starting at 70 the sequence <title>71 takes the search to point 72. At point 72 the characters A-Z, a-z, 0-9 (73) loop the search back to point 72. At point 72 the symbols in the set !(A-Z, a-z, 0-9)! (</title>) 76 take the search to point 77 and the search fails. At point 72 the sequence </title>74 takes the search to point 75 and the end. Thus, the identification of the pair of sequences <title> </title> identifies a page title between them.
  • Alternatively, when the search may be for a Date-Time format.
  • The pattern is:
  • Jan. 1 2008 SPACE10:20:22
  • In this case the pair of labels are:
  • Month and :NUM NUM !(NUM)
  • The month can be one from the set of patterns Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. NUM indicates one of the characters 0-9 and !(NUM) means not one of the characters 0-9. In this case a bridge character is needed to link the date and time parts. A suitable bridge is the SPACE character after the year. The example is shown in FIG. 7.
  • Starting at point 89, a valid month 90 moves the search to point 91. From point 91 any character 92 takes the search to point 93. At point 93 any character loops the search back to point 93. At point 93 the SPACE character 95 takes the search to point 96. At point 96 any character 97 takes the search to point 98. At point 98 any character 99 loops the search back to point 98. At point 98 the sequence: NUMNUM!(NUM) 100 completes the search 101.
  • The present invention allows sections of data to be identified and extracted. Although the examples have been described using hyperlinks and domain names, the invention can be applied to many other end user identifier types including email address identification; URI/URL identification; Session Initiation Protocol (SIP) URI identification; E.164 telephone number detection; tag detection in other data formats; IP addresses, port range, protocol and session identifier detection; xml data structures, xml objects; HTML structures and objects; and detection of content types and identification of content from packet payloads. The basic method can be improved to increase throughput and processing speed by use of an anchor structure, or looking for an ngram containing an anchor symbol.
  • The combination of separate encoded sequences represented by smaller state machines into a group of state machines to produce the full format of an end user identifier, or label, allows labels of arbitrary complexity to be detected. Further improvements in throughput arise from the use of parallel processing, exploiting machine word size to run several instances of a super machine in parallel.
  • FIGS. 8A-8D illustrate exemplary extraction and filtering arrangements for electronic mail in accordance with the present invention. In the arrangement of FIG. 8A a separate filtering and extraction server (e.g., SPAM filter) 805A is provided for performing the extraction and filtering described above. SPAM filter 805A can be executed by an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to perform the extraction and filtering functions. The SPAM filter 805A is coupled to an e-mail server 810A, which in turn is coupled to a terminal 815A. Terminal 815A can be any type of terminal, including a desktop computer, laptop computer and/or a wireless computing device (e.g., a wireless telephone and/or e-mail device). Terminal 815A includes an e-mail client 820A for receiving the e-mails that pass from SPAM filter 805A through e-mail server 810A to terminal 815A. Terminal 815A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to execute the e-mail client. The e-mails can be output on printer 825, display 830 or any other type of output device. In particular, if an e-mail is filtered then it would not be provided to terminal 815A, whereas those that are not filtered would be provided to the terminal. In other words, the e-mails that are discarded by SPAM filter 805A are those that are passed to the terminal, whereas those that are output from lookup table 6 are filtered and not passed to the terminal.
  • In the arrangement of FIG. 8B SPAM filter 805B can be included in e-mail server 810B. In this arrangement SPAM filter 805B can be a separate program on the same hardware as the e-mail server 810B and/or can be a program executing within the e-mail server program.
  • In the arrangement of FIG. 8C SPAM filter 805C can be included in terminal 815C. In this arrangement SPAM filter 805C can be a program executing on terminal 815C.
  • In the arrangement of FIG. 8D SPAM filter 805D can be included in e-mail client 820D. In this arrangement SPAM filter 805D can be, for example, a plug-in for e-mail client 820D.
  • FIGS. 9A-9D illustrate exemplary extraction and filtering arrangements for URLs in accordance with the present invention. In the arrangement of FIG. 9A a separate server 905A is provided for performing the extraction and filtering described above. Server 905A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to perform the extraction and filtering functions. The extraction and filtering server 905A is coupled to a web server 910A, which in turn is coupled to a terminal 915A. Terminal 915A can be any type of terminal, including a desktop computer, laptop computer and/or a wireless computing device (e.g., a wireless telephone and/or e-mail device). Terminal 915A includes a browser client 920A for browsing web pages that pass from the extraction and filtering server 905A through web server 910A to client 915A. Terminal 915A can include an application specific integrated circuit (ASIC), microprocessor executing computer code, field programmable gate array and/or the like to execute the e-mail client. The web pages can be output on printer 925, display 930 or any other type of output device. In particular, if a web page passes through the filter then it would not be provided to terminal 815A, whereas those that are not filtered would be provided to the terminal. In other words, the web pages that are discarded by the extraction and filtering server 805C are those that are passed to the terminal, whereas those that are output from lookup table 6 are filtered and not passed to the terminal.
  • In the arrangement of FIG. 9B extraction and filtering server 905B can be included in web server 910B. In this arrangement server 905B can be a separate server executing on the same hardware as the web server 910B and/or can be a program executing within the web server program.
  • In the arrangement of FIG. 9C extraction and filtering server 905C can be included in terminal 915C. In this arrangement server 905C can be a separate server or can be a program executing on terminal 915C.
  • In the arrangement of FIG. 9D extraction and filtering server 905D can be included in browser client 920D. In this arrangement server 905D can be, for example, a plug-in for browser client 920D.
  • Although FIGS. 8A-8D and 9A-9D are described in connection with so-called blacklists, in which a match with the lookup table causes the email or web page to be excluded and not delivered to the terminal, the present invention can also be implemented with so-called whitelists. In this case a match with the lookup table allows the email or web page to be delivered to the terminal and a failure to match with the lookup table excludes the email or web page from being delivered.
  • It will be recognized from the discussion of FIGS. 8A-8D and 9A-9D above that the designation of the filtering and extraction element as being a server is used to cover a variety of different arrangements, including a physical server, a server program, a regular executable program and a plug-in program. Accordingly, the term server should be interpreted accordingly in connection with the claims.
  • The foregoing disclosure has been set forth merely to illustrate the invention and is not intended to be limiting. Since modifications of the disclosed embodiments incorporating the spirit and substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the appended claims and equivalents thereof.

Claims (23)

1. A method of extracting sections of a data stream comprising end point identifiers, the sections comprising a set of sequences; wherein each sequence is encoded separately and coupled together to define the section; the method comprising:
determining a combination of at least two sequences of the set;
comparing the combination of sequences with sequences in the data stream; and
rejecting or accepting extraction of the section of the data stream based upon the result of the comparison,
wherein when the combination of sequences does not include a start and end marker for the section, a search for the start and end markers is carried out before the section is extracted.
2. A method according to claim 1, wherein extraction of the section is accepted when the combination of sequences in any order matches sequences in the section of the data stream.
3. A method according to claim 1, wherein extraction of the section is rejected when the combination of sequences does not match any of the sequences in the section of the data stream, and
wherein the search continues for further instances of the combination of sequences.
4. A method according to claim 1, wherein a sequence comprises a series of bits having a predetermined format, including an anchor or a bridge.
5. A method according to claim 4, wherein the anchor is a statistically rare, or low probability sequence in the data stream.
6. A method according to claim 5, wherein the probability of occurrence is less than 1%.
7. A method according to claim 1, wherein the combination of sequences comprises an anchor and a sequence adjacent to the anchor.
8. A method according to claim 1, wherein the combination of sequences comprises the first and last sequence of the section.
9. A method according to claim 1, wherein the combination of sequences comprises more than one sequence associated with an anchor,
wherein the combination of anchor and sequences to form the section is determined, and
wherein the section is only extracted when all sequences forming the section are present.
10. A method according to claim 1, wherein searches for combinations of sequences are carried out in parallel on different sections of the data stream.
11. A method according to claim 1, wherein each sequence comprises a series of bits of data, or multiple bytes of data.
12. A method according to claim 1, wherein the end point identifiers comprise one of a domain name, an email address, a uniform resource locator, a telephone number, or a date and time format.
13. A method according to claim 1, wherein each sequence is encoded in a separate state machine.
14. A method according to claim 13, wherein multiple state machines are combined to represent the section.
15. A method according to claim 14, wherein a bridge provides a transition between separate state machines representing the sequences of the section
16. A method according to claim 1, the method further comprising:
filtering the extracted sections of the data stream, the filtering comprising determining a set of characters of interest;
testing each section of the data stream for the presence of one or more of the set of characters of interest; and
extracting sections in which at least one of the characters is present.
17. A method according to claim 16, comprising:
determining a further set of characters of interest;
testing for at least one character from the further set of characters in the portion of the data stream; and
extracting sections in which at least one of the characters from the further sets of characters is also present in the section.
18. A method according to claim 16, wherein the extracted sections are stored in a store.
19. A method according to claim 16, wherein the extracted sections are
input to a comparison stage;
compared with specific examples of end point identifiers; and
discarded when the section does not match one of the specific examples in the comparison stage.
20. The method of claim 19, the determining, testing and extraction are performed by an extraction and filtering server that is a stand-alone server, a component of another server, a component of a terminal or a component of a client.
21. The method of claim 20, comprising, wherein when extracted sections match one of the specific examples in the comparison stage then data associated with the extracted sections is not passed to a client and when extracted section do not match one of the specific examples in the comparison stage then data associated with the extracted section is passed to the client.
22. The method of claim 20, comprising, wherein when extracted sections does not match one of the specific examples in the comparison stage then data associated with the extracted sections is not passed to a client and when extracted section matches one of the specific examples in the comparison stage then data associated with the extracted section is passed to the client.
23. The method of claim 22, wherein the data associated with the extracted section is an email or a web page.
US12/505,147 2007-01-08 2009-07-17 Method of Extracting Sections of a Data Stream Abandoned US20090276427A1 (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
GB0700928.5 2007-01-18
GB0700926A GB2445763A (en) 2007-01-18 2007-01-18 Metadata filtering
GB0700926.9 2007-01-18
GB0700928A GB0700928D0 (en) 2007-01-18 2007-01-18 Method to process metadata
PCT/GB2008/000184 WO2008087438A1 (en) 2007-01-18 2008-01-18 A method of extracting sections of a data stream

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2008/000184 Continuation WO2008087438A1 (en) 2007-01-08 2008-01-18 A method of extracting sections of a data stream

Publications (1)

Publication Number Publication Date
US20090276427A1 true US20090276427A1 (en) 2009-11-05

Family

ID=39224886

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/505,179 Expired - Fee Related US8380795B2 (en) 2007-01-18 2009-07-17 Method of filtering sections of a data stream
US12/505,147 Abandoned US20090276427A1 (en) 2007-01-08 2009-07-17 Method of Extracting Sections of a Data Stream

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/505,179 Expired - Fee Related US8380795B2 (en) 2007-01-18 2009-07-17 Method of filtering sections of a data stream

Country Status (5)

Country Link
US (2) US8380795B2 (en)
EP (2) EP2122504B1 (en)
CA (2) CA2675756A1 (en)
DK (1) DK2122503T3 (en)
WO (2) WO2008087438A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110299515A1 (en) * 2006-12-15 2011-12-08 Research In Motion Limited Methods and apparatus for establishing wlan communications using an essid created based on a predetermined algorithm and a domain name
US20130111055A1 (en) * 2011-10-28 2013-05-02 Jichuan Chang Data stream operations
US10313348B2 (en) * 2016-09-19 2019-06-04 Fortinet, Inc. Document classification by a hybrid classifier

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559244B (en) * 2013-10-28 2016-08-24 东软集团股份有限公司 The acquisition methods of message body based on mbx form and system

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0971294A2 (en) * 1995-07-19 2000-01-12 Lernout &amp; Hauspie Speech Products N.V. Method and apparatus for automated search and retrieval processing
US20020087537A1 (en) * 2000-12-29 2002-07-04 Evans David J. Method and apparatus for searching a data stream for character patterns
US20020101877A1 (en) * 2001-01-30 2002-08-01 Altima Communications, Inc. Traffic policing of packet transfer in a dual speed hub
US20030009453A1 (en) * 2001-07-03 2003-01-09 International Business Machines Corporation Method and system for performing a pattern match search for text strings
US20030033531A1 (en) * 2001-07-17 2003-02-13 Hanner Brian D. System and method for string filtering
US20040008740A1 (en) * 1998-04-15 2004-01-15 Baker Scott L. Method and apparatus for interleaving a data stream
US20050010952A1 (en) * 2003-01-30 2005-01-13 Gleissner Michael J.G. System for learning language through embedded content on a single medium
US20050132070A1 (en) * 2000-11-13 2005-06-16 Redlich Ron M. Data security system and method with editor
WO2005074213A1 (en) * 2004-01-20 2005-08-11 Cloudmark, Inc. Method and system for url-based screening of electronic communications
US20060095262A1 (en) * 2004-10-28 2006-05-04 Microsoft Corporation Automatic censorship of audio data for broadcast
US20060268875A1 (en) * 2005-05-24 2006-11-30 The Boeing Company Method and apparatus for user identification in computer traffic
US7149780B2 (en) * 2001-12-14 2006-12-12 Pitney Bowes Inc. Method for determining e-mail address format rules
US20070118904A1 (en) * 2003-06-04 2007-05-24 Microsoft Corporation Origination/destination features and lists for spam prevention
US7305635B1 (en) * 2005-02-04 2007-12-04 Dafca, Inc. Serial implementation of assertion checking logic circuit
US20090019220A1 (en) * 2006-01-31 2009-01-15 Roke Manor Research Limited Method of Filtering High Data Rate Traffic

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3465299A (en) 1967-01-26 1969-09-02 Ibm Information translating data comparing systems
US5452451A (en) 1989-06-15 1995-09-19 Hitachi, Ltd. System for plural-string search with a parallel collation of a first partition of each string followed by finite automata matching of second partitions
EP0893779B1 (en) * 1997-07-22 2003-06-18 Sony International (Europe) GmbH Videoviewing device providing an automatic internet access
US6075550A (en) * 1997-12-23 2000-06-13 Lapierre; Diane Censoring assembly adapted for use with closed caption television
US6768991B2 (en) 2001-05-15 2004-07-27 Networks Associates Technology, Inc. Searching for sequences of character data
US7320020B2 (en) * 2003-04-17 2008-01-15 The Go Daddy Group, Inc. Mail server probability spam filter
US7155484B2 (en) * 2003-06-30 2006-12-26 Bellsouth Intellectual Property Corporation Filtering email messages corresponding to undesirable geographical regions
US7457823B2 (en) * 2004-05-02 2008-11-25 Markmonitor Inc. Methods and systems for analyzing data related to possible online fraud
US7500265B2 (en) * 2004-08-27 2009-03-03 International Business Machines Corporation Apparatus and method to identify SPAM emails
JP4880675B2 (en) * 2005-05-05 2012-02-22 シスコ アイアンポート システムズ エルエルシー Detection of unwanted email messages based on probabilistic analysis of reference resources
US8095602B1 (en) * 2006-05-30 2012-01-10 Avaya Inc. Spam whitelisting for recent sites

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0971294A2 (en) * 1995-07-19 2000-01-12 Lernout &amp; Hauspie Speech Products N.V. Method and apparatus for automated search and retrieval processing
US20040008740A1 (en) * 1998-04-15 2004-01-15 Baker Scott L. Method and apparatus for interleaving a data stream
US20050132070A1 (en) * 2000-11-13 2005-06-16 Redlich Ron M. Data security system and method with editor
US20020087537A1 (en) * 2000-12-29 2002-07-04 Evans David J. Method and apparatus for searching a data stream for character patterns
US20020101877A1 (en) * 2001-01-30 2002-08-01 Altima Communications, Inc. Traffic policing of packet transfer in a dual speed hub
US20030009453A1 (en) * 2001-07-03 2003-01-09 International Business Machines Corporation Method and system for performing a pattern match search for text strings
US20030033531A1 (en) * 2001-07-17 2003-02-13 Hanner Brian D. System and method for string filtering
US7149780B2 (en) * 2001-12-14 2006-12-12 Pitney Bowes Inc. Method for determining e-mail address format rules
US20050010952A1 (en) * 2003-01-30 2005-01-13 Gleissner Michael J.G. System for learning language through embedded content on a single medium
US20070118904A1 (en) * 2003-06-04 2007-05-24 Microsoft Corporation Origination/destination features and lists for spam prevention
WO2005074213A1 (en) * 2004-01-20 2005-08-11 Cloudmark, Inc. Method and system for url-based screening of electronic communications
US20060095262A1 (en) * 2004-10-28 2006-05-04 Microsoft Corporation Automatic censorship of audio data for broadcast
US7305635B1 (en) * 2005-02-04 2007-12-04 Dafca, Inc. Serial implementation of assertion checking logic circuit
US20060268875A1 (en) * 2005-05-24 2006-11-30 The Boeing Company Method and apparatus for user identification in computer traffic
US20090019220A1 (en) * 2006-01-31 2009-01-15 Roke Manor Research Limited Method of Filtering High Data Rate Traffic

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110299515A1 (en) * 2006-12-15 2011-12-08 Research In Motion Limited Methods and apparatus for establishing wlan communications using an essid created based on a predetermined algorithm and a domain name
US8488576B2 (en) * 2006-12-15 2013-07-16 Research In Motion Limited Methods and apparatus for establishing WLAN communications using an ESSID created based on a predetermined algorithm and a domain name
US20130111055A1 (en) * 2011-10-28 2013-05-02 Jichuan Chang Data stream operations
US8954599B2 (en) * 2011-10-28 2015-02-10 Hewlett-Packard Development Company, L.P. Data stream operations
US10313348B2 (en) * 2016-09-19 2019-06-04 Fortinet, Inc. Document classification by a hybrid classifier

Also Published As

Publication number Publication date
WO2008087438A1 (en) 2008-07-24
EP2122503A1 (en) 2009-11-25
DK2122503T3 (en) 2013-02-18
EP2122504A1 (en) 2009-11-25
CA2675820A1 (en) 2008-07-24
EP2122503B1 (en) 2012-11-14
WO2008087429A8 (en) 2008-10-30
US20090282119A1 (en) 2009-11-12
US8380795B2 (en) 2013-02-19
CA2675756A1 (en) 2008-07-24
EP2122504B1 (en) 2014-10-01
WO2008087429A1 (en) 2008-07-24

Similar Documents

Publication Publication Date Title
US10949641B2 (en) Fast signature scan
US8250016B2 (en) Variable-stride stream segmentation and multi-pattern matching
US8161381B2 (en) System and method for presenting message threads
US7054855B2 (en) Method and system for performing a pattern match search for text strings
US8015208B2 (en) Systems and methods for processing regular expressions
US20150095359A1 (en) Volume Reducing Classifier
US7895515B1 (en) Detecting indicators of misleading content in markup language coded documents using the formatting of the document
CN108021598B (en) Page extraction template matching method and device and server
US20090276427A1 (en) Method of Extracting Sections of a Data Stream
US20060268875A1 (en) Method and apparatus for user identification in computer traffic
CN112054992B (en) Malicious traffic identification method and device, electronic equipment and storage medium
CN111061972B (en) AC searching optimization method and device for URL path matching
KR20120078030A (en) System and method for detecting malicious content in a non-pe file
US20180121410A1 (en) Regular expression searching
WO2010047683A1 (en) Fast signature scan
KR100459379B1 (en) Method for producing basic data for determining whether or not each electronic document is similar and System therefor
US20050203934A1 (en) Compression of logs of language data
US20190207958A1 (en) Multi-pattern policy detection system and method
KR20070003488A (en) Regular expression representing method for efficient pattern matching in tcam and pattern matching method
JP4690232B2 (en) Information processing apparatus, software registration method, and program
Golan et al. Dynamic dictionary matching in the online model
Rowe Finding and rating personal names on drives for forensic needs
CN116522329A (en) Webshell file detection method and device
CN115270120A (en) Malicious URL blocking method
Li et al. TTSF: A Novel Two-Tier Spam Filter

Legal Events

Date Code Title Description
AS Assignment

Owner name: ROKE MANOR RESEARCH LIMITED, UNITED KINGDOM

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DUXBURY, NEIL;REEL/FRAME:023177/0278

Effective date: 20090713

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION