WO2008030568A2 - Feed crawling system and method and spam feed filter - Google Patents

Feed crawling system and method and spam feed filter Download PDF

Info

Publication number
WO2008030568A2
WO2008030568A2 PCT/US2007/019558 US2007019558W WO2008030568A2 WO 2008030568 A2 WO2008030568 A2 WO 2008030568A2 US 2007019558 W US2007019558 W US 2007019558W WO 2008030568 A2 WO2008030568 A2 WO 2008030568A2
Authority
WO
WIPO (PCT)
Prior art keywords
crawler
urls
feed
spam
crawling
Prior art date
Application number
PCT/US2007/019558
Other languages
French (fr)
Other versions
WO2008030568A3 (en
Inventor
James Ruga
Rebecca Berrigan
Original Assignee
Feedster, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feedster, Inc. filed Critical Feedster, Inc.
Publication of WO2008030568A2 publication Critical patent/WO2008030568A2/en
Publication of WO2008030568A3 publication Critical patent/WO2008030568A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This invention pertains generally to systems and methods for developing information from a network of computers such as from the Internet, and more particularly to systems and methods for feed crawling and for storing and analyzing the data or information from the crawled feeds.
  • RSS and Atom are content syndication specifications of XML feed formats that can be used in order to publish text, images, audio, video and combinations of these items. Collectively, in general these are content feeds that can be easily published and instantly syndicated out to millions of potential users on the Internet.
  • Crawling or spidering is a common method of gathering large bodies of information from the Internet. Crawlers generally have similar characteristics and crawler design is typically focused on gathering as much information as possible and storing that information for later cataloging and analysis and search.
  • a general crawling process typically includes a number of traits.
  • the process typically includes a method of sending website urls to the crawler.
  • the process typically also includes a method of sending/receiving requests from the crawler to the Internet.
  • the process typically includes a method of parsing the results received from the Internet to gather data.
  • the process typically includes a method of storing the data for indexing, search and analysis.
  • Feeds are syndicated XML messages announcing updates (of any type of information) to a web site on the internet.
  • a feed search engine may collect internet feeds and index them to provide a searchable repository for the public.
  • the repository may include feeds that are deemed to masquerade as information updates yet are not original, intelligible or informative and one may refer to these feeds as Spam.
  • entire web sites are constructed to imitate a weblog with the primary objective being revenue generation for the owner of the weblog via web click based advertising and/or leading the user to a website where the user is exposed to advertising or enticed to purchase goods. These weblogs are referred to as 'Splogs' as they are considered spam, yet take the form of a weblog.
  • FIG. 1 illustrates an embodiment of a crawler system.
  • FIG. 2 illustrates another embodiment of a crawler system.
  • FIG. 3 illustrates an embodiment of a job server used in a crawler system.
  • FIG. 4 illustrates an embodiment of a process of dispatching jobs to crawlers.
  • FIG. 5 illustrates an embodiment of a process of crawling feeds.
  • FIG. 6 illustrates an embodiment of a process of determining how to reprioritize urls.
  • FIG. 7 illustrates an embodiment of a network which may be used in conjunction with embodiments of a feed crawling system.
  • FIG. 8 illustrates an embodiment of a machine which may be used in conjunction with embodiments of a feed crawling system.
  • FIG. 9 illustrates an embodiment of a feed spam filter.
  • FIG. 10 illustrates another embodiment of a feed spam filter.
  • FIG. 11 illustrates an embodiment of a feed spam filter with inputs and outputs.
  • FIG. 12 illustrates an embodiment of a system including a feed spam filter.
  • FIG. 13 illustrates an embodiment of a process of filtering spam feeds.
  • FIG. 14 illustrates another embodiment of a process of filtering spam feeds.
  • FIG. 15 illustrates an embodiment of a system or network in which a feed spam filter may operate.
  • FIG. 16 illustrates an embodiment of a system which may operate with a feed spam filter.
  • a feed crawling system, method, and computer program product A spam filter and method for filtering.
  • a system and method for feed crawling with spam filtering A spam filter and method for filtering.
  • a computer system for crawling content feeds comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.
  • urls location information or universal reference locators
  • a method for crawling content feeds comprising: at least one processor for executing at least one process; providing a database providing a storage for storing location information or universal reference locators (urls); executing a first process for prioritizing a list of urls to be crawled; executing a parallelized crawler process for crawling the urls and storing the results in the database; and executing an indexing process for indexing the database for a user to search.
  • urls location information or universal reference locators
  • a system, method and apparatus is provided for a feed crawling system.
  • the specific embodiments described in this document represent examples or embodiments of the present invention, and are illustrative in nature rather than restrictive.
  • numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention.
  • a feed search system may use a crawler system to collect the data from feeds and organize it into a searchable body of knowledge, which is convenient for an end user to search based on keywords representing topics.
  • the crawling process by its nature is typically random in the order in which feeds are crawled. This means that feeds are actually cataloged in a searchable database in the order in which they were crawled, and there is no predictable way to dictate when information on a particular topic will become available in the search results.
  • a user searching for information on a particular event of interest to them, whether that event is regional or global may or may not find information on that event simply due to the fact that a particular feed may or may not have been crawled on that topic. This may greatly impair the usability of the search engine.
  • the timeliness of information in the realm of feeds is important - and one reason for the existence of feeds.
  • the ease with which information can be syndicated means that more people can publish information on regional and global events faster than all of that information can be collected and cataloged by any crawling system. It is thus potentially important to prioritize the crawling process and intelligently collect data from feeds that are of more interest to search engine users, before crawling information that is of less interest.
  • XML feeds can update several times per minute, and can carry staggering amounts of information compared to HTML. The combination of these factors makes the needs of crawling XML feeds a specialized problem to solve.
  • a feed crawling system may differ from HTML crawlers in a number of ways.
  • HTML crawling involves a great deal of parsing (separating data from style information) and is prone to tainting the relevancy of the data (spamming).
  • Updates occur more frequently on feeds than on HTML only web sites; so feeds must be crawled frequently to index new data in a timely fashion.
  • Each feed overall must be prioritized based on amount of relevance to a particular topic for efficiency; while at the same time classifying each data item from the feed according to its relevancy to the same or similar list of topics. Since there is potentially several orders of magnitude more raw data from feeds than from HTML websites, only feeds that have topics of interest to users of the search engine should be crawled in some embodiments. In other embodiments, frequency of crawling is determined based on apparent relevance. Additionally, a crawler may distinguish possible spam feeds and posts from genuine feeds and posts on a particular topic to make search results more useful.
  • a feed crawling system uses a combination of components from a feed search engine to determine what feeds to crawl and at what priority. These search engine components include: a relevancy engine, a spam filter, an indexer and a search API.
  • the anatomy of the crawler itself includes a crawler job server where input from these components is synthesized to make the decisions necessary to control a parallelized distributed crawling network of machines.
  • this solution to crawling XML feeds incorporates much of the accepted methods of managing networks of crawlers and parallelization. It builds on these methodologies to make the crawling more efficient from a hardware usage perspective, and at the same time attempts to optimize the available hardware by attempting to optimize what the crawler spends time crawling.
  • the crawler job server helps make the whole system efficient and creates the statistical information used to make critical decisions about where to allocate crawling resources. This potentially provides an ability to adapt the crawls in order to respond the need for results in searches on keyword terms. Such adaptability allows for crawling of feeds to occur in a timely manner based on what users currently are searching for, and thus are likely to continue to search for.
  • the parallelized distributed crawler is the part of the crawler that does the actual work in some embodiments. This over-emphasizes the importance of the parallelized distributed crawler, but the parallelized distributed crawler may perform a number of tasks in one embodiment.
  • the parallelized distributed crawler may determine if a particular url/domain has been blacklisted (spam) - this may be through use of a separate spam filter, for example.
  • the parallelized distributed crawler may also request the content from a url.
  • the parallelized distributed crawler may determine if the received content is in fact an XML feed.
  • the parallelized distributed crawler may determine what type of XML feed data has been received, such as atom, rss, rdf or opml, for example, if an HTML page is received instead of a valid XML feed, the parallelized distributed crawler may attempt to find a corresponding XML feed for a site. For an actual feed, the parallelized distributed crawler may prepare the feed for parsing (normalization to UTF8, decoding characters, correcting well-form ed-ness errors, for example). Additionally, the parallelized distributed crawler may parse the feed (e.g. extracting the data based on the feed type schema, for example). Likewise, the parallelized distributed crawler may apply a spam filter to determine if the content (and/or feed) is spam. This may be separate, based on received content, rather than based on a url, for example.
  • a spam filter to determine if the content (and/or feed) is spam. This may be separate, based on received content, rather than based on
  • the parallelized distributed crawler may analyze the received data to determine attributes (and/or metadata), for example. This may include type of content (text, images, audio, video, etc). It may also include overall language of the content (and of the feed by percentage, for example). It may further include relevant taxonomy of the content (and of the feed).
  • the parallelized distributed crawler may store the data into a database schema that is suitable for the indexer to work from. In one embodiment, the parallelized crawlers run on one computer consuming as much of its resources as is practical (or as is allowed). Parallelization may be simply accomplished by spawning off child crawler processes each with their own unique list of urls to process. Alternatively, this can be accomplished by threading on some platforms. On completion of the processing of the list of URLs, the child crawler reports back to the parent with the results of the crawl; and the parent is then free to spawn a new child with more urls to crawl.
  • FIG. 1 illustrates an embodiment of a crawler system.
  • Generic processes 110 include last indexed crawler 103, fast index crawler 106, discovery crawler 109, manual crawler 112 which receives urls from a batch process 115 and ping crawler 118.
  • Most of the generic processes draw data from a database 130 of urls and content to determine which urls should be crawled next, or to prioritize or reprioritize a list of urls for crawling.
  • Ping crawler 118 receives data from a pings database 140 of pings from feeds indicating a feed update.
  • Last indexed crawler 103 thus indicates what was least recently indexed, fast index crawler 106 indicates what url typically updates quickly and has not been crawled recently, discovery crawler 109 indicates what urls are likely to be new feeds, and manual crawler 112 indicates what urls a user has submitted for crawling, for example.
  • a generic parallelized crawler 120 is implemented through use of a number of child crawler processes 125. Results of the child crawler processes 125 are fed through finishing routines 135, which can parse feeds, filter for spam, retrieve feeds in some instances, and provide a database interface, for example.
  • Crawler 120 upon crawling a feed, provides a UDP broadcast to feed stream 170 and to the rest of the system. Feed stream 170 may provide information to a TCP client 175 indicating a feed was crawled, allowing for external notification.
  • a firewall protects the system from unauthorized external access.
  • the system also receives authorized information from internet 165 through feed mesh 160 (a series of hooks into feed subscriptions, for example) and through apache server 155 which may be interfaced with a ping api. Data may also be received through http/telnet interface 150. All of this received data provides inputs to pings database 140, and may provide information to directory and change file robot 145. Robot 145 may be expected to maintain a directory of information about feeds, and to detect when new feeds are introduced, feeds are no longer in existence, feed data formats change, or other changes occur.
  • this system is distributed among various machines or processors.
  • the crawlers can co-exist on multiple networked machines simultaneously, without interfering or overlapping with jobs being crawled by the other machines.
  • jobs may be distributed to crawler machines in the network that are available to handle the added work, such as through load-balancing techniques.
  • the generic processes are embodied as smaller bits of code that exist separate from the actual crawling code itself in some embodiments. Each generic process is specialized in some way to draw its input from various sources of potential feeds to crawl. These generic processes can perform various functions. For example, a process may select and compile the actual list of urls from the database that needs to be crawled. Similarly, a process may periodically check existing HTML web sites for the addition of, or replacement by XML feeds. Likewise, a process may select lists of feed urls to crawl by periodic interval or select lists of feed urls to crawl by overall age since last crawl. Moreover, a process may crawl web sites and feed urls on demand as submitted one at a time. Other processes may be implemented in various embodiments as well.
  • FIG. 2 illustrates another embodiment of a crawler system.
  • Generic configurators 210 generally use processes previously described with respect to FIG. 1.
  • adaptive crawler input 290 receives data from search terms 293 and topical events 296 to provide further options for urls to be crawled.
  • sequential ping crawler 219 and probabilistic ping crawler 216 use ping data to determine which recently updated feeds to crawl next, with some chosen based on a first-in-flrst-out model and others chosen based on probability that the updated feed will be searched for.
  • Job server 285 determines which urls should be crawled next by parallelized crawler 120, and directs dispatch of child crawlers 125 in some embodiments. Also, data is stored in a distributed database 230. Thus, the embodiment of FIG. 2 may be adaptable to changing searches of feeds, and may produce more timely results than a crawler without adaptive features.
  • the crawler job server provides business intelligence and performs other functions as well.
  • the job server may accept the lists of urls from the generic processes and determine if urls are relevant to user searched topics.
  • the job server may also mine known XML feeds (and HTML websites to locate additional XML feeds) for additional relevant content.
  • the job server may prioritize the urls to be crawled.
  • the job server may break up the lists of urls into jobs to assign to the distributed crawler machines with available resources and throttle crawl frequency on feeds that have relevance to a popular topic.
  • the job server may delay jobs or stop jobs on the crawler machine network in favor of crawling more popular content on demand.
  • the job server may also record and map topical and statistical trends on topics for advertising targeting and report the statistics in a human readable log file or user interface, for example.
  • FIG. 3 illustrates an embodiment of a job server used in a crawler system.
  • System 300 includes primarily the job server and parallelized crawler, and surrounding components.
  • Job server 285 uses a pool of urls 305 with either an ordered list of tags indicating priority to determine which jobs to dispatch.
  • a url is first checked by spam filter 315 for spam characteristics. The url then is prioritized by prioritizing routines 325, determining if the url is relevant, currently popular, fits an emerging trend, or was selected by a user, for example, and then is passed to the network control of crawler 120.
  • a search api 335 used to search feeds may provide data on what is currently popular, for example.
  • Crawler 120 causes a child process 125 to crawl the url in question, determining what data is available through the url.
  • Crawl job results 365 are thus produced, and may feed back through throttle adjustment module 375 to alter operation of crawler 120.
  • Results 365 are also compiled into statistics through statistics generation module 355 and provided to database 130.
  • statistics generator 355 may also supply information to ad server 385 to allow for monetization or commercialization of the feed search engine and crawler through advertising.
  • the overall crawler relies on feedback from multiple systems. Each such system may have its own specifications with unique properties. Various portions of the crawler focus on how that information is used to make decisions to perform a more efficient crawl.
  • the crawler may use a sorting and prioritizing algorithm to determine what urls to crawl.
  • the crawler job server accepts urls from multiple inputs including lists of urls that it generates itself. These lists represent a possible pool of urls to crawl. From this pool, the urls with most relevance to the each of the search terms entered by users of the search engine should be crawled first. This way posts from the feeds on these urls can be added to the search results as quickly as possible, making the search results timely.
  • each url a relevancy value between 0 and 1, for example, which is calculated based on multiple components.
  • Such components may include semantic relevancy to keywords and/or groups of keywords, frequency of new postings on that url, popularity of the overall topic, authority of a feeds content on a particular subject, for example.
  • a crawling priority can be assigned to each url as a function of the relevancy value and the demand for results on a topic (implied from frequency of searches on a particular term over time). At any given instant a snapshot of the pool of urls to be crawled may be sorted based on this priority.
  • FIG. 4 illustrates an embodiment of a process of dispatching jobs to crawlers.
  • Process 400 includes generating a list of urls, receiving updated information, receiving environmental information, reordering the list of urls, dispatching urls to crawlers, receiving responsive data from crawlers and integrating the responsive data into a database.
  • Process 400 and other processes of this document are implemented as a set of modules, which may be process modules or operations, software modules with associated functions or effects, hardware modules designed to fulfill the process operations, or some combination of the various types of modules, for example.
  • the modules of process 400 and other processes described herein may be rearranged, such as in a parallel or serial fashion, and may be reordered, combined, or subdivided in various embodiments.
  • Process 400 initiates with generation of a list of urls at module 410. This may involve looking up urls or retrieving a previously generated list of urls, and the list may be more in the form of a tagged set of urls rather than a traditional list.
  • update information about feeds is received, such as indications that some feeds have recently been updated, for example.
  • environmental information about searches conducted and external events is received.
  • the data from modules 420 and 430 is used at module 440 to reorder the list of urls or otherwise reprioritize urls for crawling purposes.
  • UrIs are then dispatched to a crawler or crawlers to be retrieved at module 450. Responsive data is received from the crawlers at module 460, and the responsive data is integrated into a database or repository at module 470.
  • FIG. 5 illustrates an embodiment of a process of crawling feeds.
  • Process 500 includes receiving a url job, filtering for spam, requesting data at the url, determining if the content is usable, determining a type of feed or attempting to translate data, parsing a feed, filtering a feed, skipping a feed if necessary, analyzing a feed, and returning results.
  • Process 500 begins with receipt of a url job from a job dispatch process, for example, at module 510.
  • a determination is made as to whether the url is a spam url. If so, the url is skipped at module 590, and results are returned at module 580.
  • the url is not spam, data at the url is requested at module 520. If the content is determined to not be usable at module 530, an attempt is made to translate the feed into something usable at module 535. Such translation may involve transformation of HTML data into XML data, for example. If the translation cannot work, the feed is skipped at module 590. If translation works, or is not needed, the type of feed is determined at module 540. At module
  • the feed data is parsed. Additionally, at module 560, the feed data is filtered, for spam for example, and the feed may be skipped at module 590 if it appears to mainly contain spam.
  • the feed is analyzed, extracting information and categorizing that information based on an overall hierarchy, for example.
  • results are then returned.
  • Many feeds ping subscribers upon updates to a feed. Feed publishers that wish to be found and indexed within a feed search engine can ping an external publicly available api when they publish updated information on their feeds. This can then trigger the crawler to enter the url to their feed into a queue which is then crawled in the order in which it was received (roughly). By employing the sorting and prioritizing algorithms above, the crawler job system can specifically select urls that are more important to crawl first from these external pings. This potentially greatly increases the speed at which new data appears in the search results.
  • FIG. 6 illustrates an embodiment of a process of determining how to reprioritize urls.
  • Process 600 may represent a process or a set of modules implementing functions, for example.
  • Process 600 includes receiving external data from various sources, analyzing changes in such data, and determining which feeds are thereby affected.
  • Process 600 includes receiving feed update notifications at module 610, receiving search query data at module 620 and receiving actual updated data at module 630. All of these modules feed into analysis of changes in data at module 640, where a determination is made as to how searches are changing, what types of feeds are becoming more active (and thus potentially more interesting to searchers) and what types of data is changing. This analysis results in checks on feeds of at least three different types in one embodiment.
  • feeds are searched for and tagged based on relevance to external data changes — the direction external data appears to be moving in.
  • feeds with recent updates are tagged, potentially in conjunction with some indication of relevance.
  • feeds not recently crawled are tagged, again potentially in conjunction with some indication of relevance.
  • the tagged feeds may then be prioritized into a list or reprioritized if already part of a list.
  • Another potential benefit of this method is for removal of pings that lead to urls that have spam feeds. Essentially feeds that lead to spam would also have a low crawling priority and so they would either be delayed or simply not crawled at all (depending on their priority and degree of spam as calculated by the spam filter).
  • the crawler may also mine known XML feeds and HTML sites to locate additional XML feeds for additional relevant content. Variations in the results of searches on blogs or feeds are typically temporal in nature. Any particular set of search results on any given topic is theoretically outdated the instant any person on the Internet publishes something new about that topic. In these cases, the crawler may respond by immediately crawling the new content to add the new content to the search results. Often, for more obscure search terms, and groups of terms there is a limited amount of search results available. When this happens for search terms in relation to popular events (such as breaking news), the crawler may recognize both a change in the search trends (increase in popularity of that particular search) and the lack of content available for results.
  • the response to this is to begin crawls to discover possible new content from the known list of feeds that would add to the existing search results.
  • This mechanism depends to some degree on keeping historical references to a baseline frequency of searches for certain topics and an average expected number of results. Then, as part of the prioritizing process, the total results for any given popular search can be queried directly from the API. If the number of results for a particular topic is not sufficient based on some previously defined threshold, then the crawler job server may find more urls to add to the crawling pool in relation to that particular search.
  • the database is queried for the known feeds on a topic according to a pre-defined taxonomy to find feeds that have one, a relatively high post rate and two, have not been crawled recently.
  • the crawler selects these urls to crawl sorting them based on the average post rate, in an attempt to discover additional content to add to the search results.
  • This essentially allows the crawler to mine the data in a list of known XML feeds. However, it depends on having the feeds categorized via a combination of the relevancy engine and a potentially human-defined and -reviewed taxonomy.
  • the crawler can also examine a list of HTML sites to see if XML feeds can be discovered from sites that actually contain search term keywords within the domain name of the url. This is potentially a "last attempt" to mine data within the database for additional content.
  • the HTML list may be composed of sites that have been submitted for indexing (via a feed search engine's ping API, for example) and failed on previous crawls to have a corresponding feed found. These sites are saved and re-examined periodically, but in the case of actually needing additional search results this mechanism may increase the frequency with which certain HTML sites are rechecked for new feeds.
  • Throttling crawl frequency in response to search topic popularity may be used to increase the effectiveness of crawling in some embodiments. Aside from simply triggering mining of the database for a quantity of search results, there maybe times when a search topic has a sustained period of activity, and thus would warrant repeated frequent crawls on XML feeds that had high relevance to that search. An example of this would be when random events become popular topics or breaking news, such as Hurricane Katrina, a Mars probe, or an unexpected political transformation, to name a few examples.
  • the crawler job server detects the appearance of a new term or combination of terms from the search api that either has not been previously searched or has not been searched for an extended period of time. In this case the crawler job server analyzes the searches to see if a growth pattern has begun to emerge in the search frequencies of this search. In response to determining a clear growth pattern over a short period of time to a particular search, the crawler then compiles a list of sites that had high relevancy to that search and performs repeated crawls on these sites at regular intervals to gather new content as quickly as possible. Likewise, as these searches subsided the relevant sites would be crawled less frequently.
  • Crawler network control can further be used to improve crawler effectiveness in some embodiments.
  • All of the inputs to the crawler job server are "post search"; meaning the input is always received after the search is performed. This causes the entire crawling system to be responsive to emerging trends of search patterns rather than have a pre-defined sequence of events or timings it must follow. This is useful to the design especially when dealing with feeds.
  • Other crawling systems search outward from a seed site and attempt to rank based on the content in relation to the seed site. This idea is potentially flawed when cataloging feeds because the calculated rank based on the content would change as soon as a new post is made to the feed. This makes the relevancy ranking nearly impossible to calculate.
  • a feed crawler is thus typically less of a traditional crawler and more of a search tool because of the way it responds to the input from the search api and relevancy engines. It starts with sites that were last known to have content most like what is being searched for, and then populates or updates the database with as much appropriate content for the current trend in searches as it can find.
  • the crawler job server is able to communicate to the individual machines in the crawler network. It is able to interrupt crawls on lower priority urls in order to recruit crawling resources to crawl more relevant higher priority urls. This allows the crawler to speed the gathering of results for searches that show a clear increase in popularity.
  • the crawler job server "balances" the work of crawling over the network of crawler machines. This allows the network to be composed of heterogeneous hardware, such that a job composed of several urls to be crawled can be handed to any machine in the network and the next job would not be handed to that same machine until it had reported back to the crawler job server a successful job completion, for example.
  • crawl notification streaming can be useful in some embodiments.
  • Search engine partners may want to know precisely what urls have been recently crawled for various purposes. Because of the complexities in choosing what urls get crawled and in which priority, this can be difficult.
  • the generic parallelized crawlers may broadcast to the internal network via UDP the results of successful crawls in an embodiment. These broadcasts can be picked up by a crawl-streaming server, which then assembles them as received into a stateless spool that can be listened to via a network port on a TCP connection.
  • a broadcast can also be used to maintain information about which urls are successfully crawled and which are not, thereby providing data about which urls are still active, for example.
  • Mapping crawler statistical trends on topics for advertising targeting can also be a useful component of a crawler in some embodiments.
  • the crawler is the earliest possible single component that has the ability to calculate priority and changes in search topic trends over time.
  • the crawler job server may be logical to have the same data that is actively being calculated and stored by the crawler job server to also be available to the targeting system of an ad server. This can be accomplished very simply by either direct database interface or by UDP internal broadcast of data to the ad server. That data may then be used to select and cache groups of ads appropriate to deliver with weighted frequency on pages that display search results or directly within subscribed rss feeds of search results.
  • FIG. 7 illustrates an embodiment of a network which may be used in conjunction with embodiments of a feed crawling system.
  • FIG. 8 illustrates an embodiment of a machine which may be used in conjunction with embodiments of a feed crawling system.
  • the following description of FIGs. 7-8 is intended to provide an overview of device hardware and other operating components suitable for performing the methods of the invention described above and hereafter, but is not intended to limit the applicable environments. Similarly, the hardware and other operating components may be suitable as part of the apparatuses described above.
  • the invention can be practiced with other system configurations, including personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • the invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • FIG. 7 shows several computer systems that are coupled together through a network 705, such as the internet, along with a cellular or other wireless network and related cellular or other wireless devices.
  • the term "internet” as used herein refers to a network of networks which uses certain protocols, such as the TCP/IP protocol, and possibly other protocols such as the hypertext transfer protocol (HTTP) for hypertext markup language (HTML) documents that make up the world wide web (web).
  • HTTP hypertext transfer protocol
  • HTML hypertext markup language
  • the physical connections of the internet and the protocols and communication procedures of the internet are well known to those of skill in the art.
  • Access to the internet 705 is typically provided by internet service providers (ISP), such as the ISPs 710 and 715.
  • ISP internet service providers
  • client computer systems 730, 750, and 760 obtain access to the internet through the internet service providers, such as ISPs 710 and 715.
  • ISPs 710 and 715 Access to the internet allows users of the client computer systems to exchange information, receive and send e-mails, and view documents, such as documents which have been prepared in the HTML format.
  • These documents are often provided by web servers, such as web server 720 which is considered to be "on" the internet.
  • web servers are provided by the ISPs, such as ISP 710, although a computer system can be set up and connected to the internet without that system also being an ISP.
  • the web server 720 is typically at least one computer system which operates as a server computer system and is configured to operate with the protocols of the world wide web and is coupled to the internet.
  • the web server 720 can be part of an ISP which provides access to the internet for client systems.
  • the web server 720 is shown coupled to the server computer system 725 which itself is coupled to web content 795, which can be considered a form of a media database. While two computer systems 720 and 725 are shown in FIG. 7, the web server system 720 and the server computer system 725 can be one computer system having different software components providing the web server functionality and the server functionality provided by the server computer system 725 which will be described further below.
  • Cellular network interface 743 provides an interface between a cellular network and corresponding cellular devices 744, 746 and 748 on one side, and network 705 on the other side.
  • cellular devices 744, 746 and 748 which may be personal devices including cellular telephones, two-way pagers, personal digital assistants or other similar devices, may connect with network 705 and exchange information such as email, content, or HTTP-formatted data, for example.
  • Cellular network interface 743 is representative of wireless networking in general. In various embodiments, such an interface may also be implemented as a wireless interface such as a Bluetooth interface, IEEE 802.11 interface, or some other form of wireless network. Similarly, devices such as devices 744, 746 and 748 may be implemented to communicate via the Bluetooth or 802.11 protocols, for example. Other dedicated wireless networks may also be implemented in a similar fashion.
  • Cellular network interface 743 is coupled to computer 740, which communicates with network 705 through modem interface 745.
  • Computer 740 may be a personal computer, server computer or the like, and serves as a gateway. Thus, computer 740 may be similar to client computers 750 and 760 or to gateway computer 775, for example. Software or content may then be uploaded or downloaded through the connection provided by interface 743, computer 740 and modem 745.
  • Client computer systems 730, 750, and 760 can each, with the appropriate web browsing software, view HTML pages provided by the web server 720.
  • the ISP 710 provides internet connectivity to the client computer system 730 through the modem interface 735 which can be considered part of the client computer system 730.
  • the client computer system can be a personal computer system, a network computer, a web tv system, or other such computer system.
  • the ISP 715 provides internet connectivity for client systems 750 and 760, although as shown in FIG. 7, the connections are not the same as for more directly connected computer systems.
  • Client computer systems 750 and 760 are part of a LAN coupled ' through a gateway computer 775. While FIG.
  • each of these interfaces can be an analog modem, isdn modem, cable modem, satellite transmission interface (e.g. "direct PC"), or other interfaces for coupling a computer system to other computer systems.
  • modem can be an analog modem, isdn modem, cable modem, satellite transmission interface (e.g. "direct PC"), or other interfaces for coupling a computer system to other computer systems.
  • Client computer systems 750 and 760 are coupled to a LAN 770 through network interfaces 755 and 765, which can be ethernet network or other network interfaces.
  • the LAN 770 is also coupled to a gateway computer system 775 which can provide firewall and other internet related services for the local area network.
  • This gateway computer system 775 is coupled to the ISP 715 to provide internet connectivity to the client computer systems 750 and 760.
  • the gateway computer system 775 can be a conventional server computer system.
  • the web server system 720 can be a conventional server computer system.
  • a server computer system 780 can be directly coupled to the LAN
  • FIG. 8 shows one example of a personal device that can be used as a cellular telephone (744, 746 or 748) or similar personal device, or may be used as a more conventional personal computer, as an embedded processor or local console, or as a PDA, for example.
  • a device can be used to perform many functions depending on implementation, such as monitoring functions, user interface functions, telephone communications, two-way pager communications, personal organizing, or similar functions.
  • the system 800 of FIG. 8 may also be used to implement other devices such as a personal computer, network computer, or other similar systems.
  • the computer system 800 interfaces to external systems through the communications interface 820.
  • this interface is typically a radio interface for communication with a cellular network, and may also include some form of cabled interface for use with an immediately available personal computer.
  • the communications interface 820 is typically a radio interface for communication with a data transmission network, but may similarly include a cabled or cradled interface as well.
  • communications interface 820 typically includes a cradled or cabled interface, and may also include some form of radio interface such as a Bluetooth or 802.11 interface, or a cellular radio interface for example.
  • the computer system 800 includes a processor 810, which can be a conventional microprocessor such as an Intel pentium microprocessor or Motorola power PC microprocessor, a Texas Instruments digital signal processor, or some combination of the various types or processors.
  • Memory 840 is coupled to the processor 810 by a bus 870.
  • Memory 840 can be dynamic random access memory (dram) and can also include static ram (SRAM), or may include FLASH EEPROM, too.
  • the bus 870 couples the processor 810 to the memory 840, also to non- volatile storage 850, to display controller 830, and to the input/output (I/O) controller 860. Note that the display controller 830 and I/O controller 860 may be integrated together, and the display may also provide input.
  • the display controller 830 controls in the conventional manner a display on a display device 835 which typically is a liquid crystal display (LCD) or similar flat-panel, small form factor display.
  • the input/output devices 855 can include a keyboard, or stylus and touchscreen, and may sometimes be extended to include disk drives, printers, a scanner, and other input and output devices, including a mouse or other pointing device.
  • the display controller 830 and the I/O controller 860 can be implemented with conventional well known technology.
  • a digital image input device 865 can be a digital camera which is coupled to an I/O controller 860 in order to allow images from the digital camera to be input into the device 800.
  • the non-volatile storage 850 is often a FLASH memory or read-only memory, or some combination of the two.
  • a magnetic hard disk, an optical disk, or another form of storage for large amounts of data may also be used in some embodiments, though the form factors for such devices typically preclude installation as a permanent component of the device 800. Rather, a mass storage device on another computer is typically used in conjunction with the more limited storage of the device 800. Some of this data is often written, by a direct memory access process, into memory 840 during execution of software in the device 800.
  • machine-readable medium or “computer-readable medium” includes any type of storage device that is accessible by the processor 810 and also encompasses a carrier wave that encodes a data signal.
  • the device 800 is one example of many possible devices which have different architectures.
  • devices based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 810 and the memory 840 (often referred to as a memory bus).
  • the buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
  • the device 800 is controlled by operating system software which includes a file management system, such as a disk operating system, which is part of the operating system software.
  • a file management system such as a disk operating system
  • an operating system software with its associated file management system software is the family of operating systems known as Windows CE® and Windows® from Microsoft Corporation of Redmond, Washington, and their associated file management systems.
  • Another example of an operating system software with its associated file management system software is the Palm® operating system and its associated file management system.
  • the file management system is typically stored in the non-volatile storage 850 and causes the processor 810 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 850.
  • Other operating systems may be provided by makers of devices, and those operating systems typically will have device-specific features which are not part of similar operating systems on similar devices.
  • WinCE® or Palm® operating systems may be adapted to specific devices for specific device capabilities.
  • Device 800 may be integrated onto a single chip or set of chips in some embodiments, and typically is fitted into a small form factor for use as a personal device. Thus, it is not uncommon for a processor, bus, onboard memory, and display/I-O controllers to all be integrated onto a single chip. Alternatively, functions may be split into several chips with point- to-point interconnection, causing the bus to be logically apparent but not physically obvious from inspection of either the actual device or related schematics.
  • Embodiments of the above system, method and computer program product may advantageously implement a spam filter.
  • the spam filter in the above described embodiments is not limited to any particular spam filter and a variety of alternatives known in the art or to be developed may be utilized.
  • a particular spam filter and method for spam filtering may be utilized, embodiments of which are described hereinafter.
  • a system, method and apparatus is provided for a feed spam filter.
  • the specific embodiments described in this document represent examples or embodiments of the present invention, and are illustrative in nature rather than restrictive.
  • the feed spam filter builds on the basic Bayesian filtering technique, yet it differs from current spam solutions as it incorporates a complex collection of features that identify spam especially in feeds.
  • the feed spam filter builds on well described approaches to minimize the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption.
  • the approach recognizes that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing.
  • the filter contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make it useful and applicable to a range of tasks for a feed search engine or similar system.
  • the basic design of the spam filter involves a Bayesian Engine used to : • receive an XML based feed and feed origin web site information
  • the design can be enhanced to produce a Bayesian Net arrangement on the probes where the probes can be run :
  • the spam filter is designed to work in multiple environments.
  • the filter can be incorporated into the crawler architecture of a feed search engine, but can also run in a standalone mode. If environment-specific adapters are provided, the spam filter can take input from: • historical tables
  • the activities of the spam filter in multiple tasks may include: Crawler feed filtering, historical url filtering and providing diagnostics for blacklisting.
  • the filter may take information in a range of forms.
  • Filtering spam from feeds differ from filtering email as feeds because : • feeds have many more marked up fields or tags to analyze in context (e.g. Author, title, publishDate, link).
  • Emails are not generally searched through on a public forum or visible to the internet viewing public as a whole
  • Email spam is generally targeted around a subject whereas Weblog based spam is about enticing the user to click on the site, not necessarily be engaged by the content on the site
  • Feed spam filter 1100 includes a basic Bayes evaluator, a feed word hash tokenizer, a crawler input, and a series of probes to be applied to a feed.
  • Crawler input 110 is an input to the feed spam filter, providing new feed data.
  • Bayes evaluator 1120 evaluates results of probes 1150-1180, and can work with various different probes as needed - allowing for probes to be swapped in and out or activated and deactivated as necessary.
  • Feed word hash tokenizer 1130 tokenizes data from a feed for easier processing.
  • Probes 1150-1180 probe the feed data by performing various tests on the feed data or related information.
  • keyword probe 1150 may probe for keywords in the feed.
  • URL probe 1155 may probe a url provided as a source of the feed.
  • Feedster on probe 1160 may probe whether a search engine is operating — and thus provide an indication of whether a failure has occurred.
  • ZipF probe 1165 may probe whether the data in the feed fits a statistical model of other spam feeds.
  • Uncommon probe 170 may similarly probe whether a feed uses uncommon words and thus may indicate spam.
  • Photoblog probe 1175 may probe whether a high proportion of the data in the feed is images rather than words, for example.
  • Preblogger probe 1180 may probe whether the data looks more like a blog or like advertising, for example. More detail is provided later on various types of feed probes, as well.
  • FIG. 10 illustrates another embodiment of a feed spam filter.
  • Feed spam filter 1200 includes sources of data, a Bayes evaluator, a tokenizer, a language library, and a network of probes.
  • Crawler 1205 and delta indexer 1210 provide data sources.
  • Crawler 1205 provides updated feed data from crawling the web.
  • Delta indexer 1210 provides updated data from feed updates received based on a difference (delta) between old and new feed data.
  • Bayes evaluator 1225 provides the feed data to tokenizer 1220 for processing and provides the tokenized data to probes along with data from language libraries 1215 to allow for proper processing of feed data.
  • Absolute spam test 1235 may be a simple test for basic spam signals under industry standards, for example.
  • the probe network may then branch out.
  • reblogger test 1240 and zipf probe 1230 may be applied.
  • photoblog test 1245 and feedster on test 1275 may be applied.
  • uncommon word test 1270 may also be applied. These tests may feed data to other tests and affect other tests and probes as illustrated within the network.
  • additional test 1250 and additional test 1255 represent expansion or optional tests which may be incorporated, either to adapt to new spam techniques or to accommodate specific types of feeds, for example.
  • the spam filter looks for 'unnatural' distributions of words in a feed. Results are generally more reliable for larger feeds (more words) so a weighting formula may be applied to revenue generating terms and hotlist terms.
  • the spam filter pays attention to terms that are used by spammers to generate revenue.
  • the terms are weighted according to currently published rates that search engines will pay a site when an advertisement on the site, placed there by the search engine, is followed by the browser user.
  • a list of terms specified by a user can also be used to trip the spam filter. These terms can be altered dynamically.
  • UrIs for spam tend to follow the 'term loaded 1 pattern that highlights subdomains matching document folders. This arises because spammers generate these sites rather than creating them manually, so that a central key term is distributed throughout the site and the url for that site.
  • Another form of analysis is reblogging analysis. A common way to generate en mass many spam weblogs is to capture feeds from other sites and redisplay the information as their own. This activity may be referred to as ReBlogging.
  • the spam filter uses a search engine to establish if the majority of posts on a blog are in fact owned by the author — or come from another feed or site.
  • feed origin website analysis is another form of analysis.
  • the layout of the original site of a feed can provide more clues in terms of spam than the feed can.
  • the feed itself is syndicated content from the site, whereas the site itself, the layout, advertising content and link structure providing navigation to other sites is a rich source of information. Combining key term results with site structure, layout and markup can provide powerful analysis not available in the email sphere. _
  • tag based weighting Certain feed fields provide places where spammers can load inappropriate terms into the feed so that search engines choose these feeds for display.
  • the spam filter examines information from each feed field and places extra emphasis on certain fields that are important in the context of the analysis
  • feed weightings may also provide analysis opportunities.
  • feeds fall short in word count compared to the standard corpus of literature that mainstream natural language processing algorithms typically use.
  • results are typically unreliable for key term analysis
  • feeds of more than around 300 words are unreliable for 'paying key term analysis' as previously mentioned.
  • This formula is potentially open to training, meaning it may be tweaked to match human based feedback or software feedback on spam.
  • Post activity may also allow for analysis of feeds. Spammers in the feed world post often and often change the content of the feed on the fly (alternating title text, for example) to confuse spam filters. Random monitoring of feed activity that integrates with the Bayesian net filter will potentially flag sites where this type of activity occurs.
  • Client-based spam thresholding may also be used to adjust the spam filter — a score may meet a flexible limit rather than a hard-coded threshold to indicate a feed is spam. As a Bayesian filter the spam filter returns a raw score, usually ranging between, but not limited to the 0-1 range. Hence the threshold at which a feed is considered spam is configurable. Client specific and context specific threshold setting is possible based on what the feed is about, how popular it is, topical world events, topical blogosphere events and user feedback. This allows for user-based customization, among other options.
  • the spam filter is set to run the standard set of available probes against incoming and historical feeds.
  • the design of the filter is such that the context of the feed could provide the setting for a pre-designated range of probes to run.
  • the filter can be 'smart 1 in knowing which types of feeds are prone to particular types of spam.
  • Additional analysis may be based on phrase or semantic analysis. This may be implemented in a separate stage of spam filtering. Alternatively, this may be implemented as a probe or set of probes, along with other types of analysis. In such an instance, the probes may then have relationships with other probes and types of analysis previously discussed.
  • a spam filter may thus be understood to accept some basic inputs and provide an indication of whether a feed is spam.
  • FIG. 11 illustrates an embodiment of a feed spam filter with inputs and outputs.
  • System 1300 includes a spam filter 1310 and associated inputs and outputs.
  • Filter 1310 accepts as input a feed url, a feed type hash, and plain text or other feed data, for example.
  • Filter 1310 then may provide a score (scaled between 0 and 1 in some embodiments) indicating the level of spam in a feed or in a binary sense whether a feed is or is not spam. Thus, the score may provide an indication of whether to exclude the feed or not.
  • a feed may provoke such extreme reactions from various probes that the score produced is outside the expected range.
  • the Bayesian filter may be expected to sum results from various probes in a predetermined way, but need not be constrained to exactly meet an expected score range. One may expect that the Bayesian filter implemented may adapt to inputs and results over time, through machine-learning techniques for example, along with external feedback.
  • a system using a spam filter may involve a crawler to gather feed data, a spam filter, a user interface, and a repository, among other components.
  • FIG. 12 illustrates an embodiment of a system including a feed spam filter.
  • System 1400 includes a crawler, spam filter, spam filter user interface, crawler spam marker, database cleanser, spam blacklister, and a database or repository.
  • Crawler 1430 may crawl the world wide web seeking feeds and updated feeds.
  • Spam filter 1460 may receive feed data from crawler 430 and provide an indication of whether a feed is spam or not.
  • Spam management user interface 1470 may be used by a user to provide feedback on whether identified spam is actually spam, with spam marked or unmarked, filter data presented, and blacklisting facilities provided, for example.
  • Crawler adaptor 1420 may mark spam within a database 1410 based on results from spam filter 1460 and spam management user interface 1470.
  • Database cleanser 1450 may then cleanse database 1410 of marked spam, and may also cleanse database 1410 of spam based on results from spam filter 1470, whether the indicated spam is marked or not.
  • blacklister 1440 may query database 1410 for data fitting blacklist parameters, and may also compare data of database 1410 with known blacklisted data to present blacklist candidates in the user interface 1470.
  • Process 1500 includes receiving a feed update, processing feed data through a spam filter, evaluating results of the spam filter, determining if the data is ok, either passing the data along or flagging the feed as spam, and feeding results back.
  • Process 1500 and other processes of this document are implemented as a set of modules, which may be process modules or operations, software modules with associated functions or effects, hardware modules designed to fulfill the process operations, or some combination of the various types of modules, for example.
  • Process 1500 initiates with receipt of a feed update at module 1510. This may be a true update, or may be the first data received for a feed, for example.
  • the data is processed through the spam filter at module 1520.
  • results of the spam filter are evaluated. This may include looking up a spam threshold score for a type of feed, and comparing the threshold to the actual score, for example. If the data is ok (the score compares favorably to the threshold), then at module
  • the data is passed along for storage in a repository, searching and display. If the score compares unfavorably, then the data is flagged as spam at module 1560. Either way, the results are fed back into the system at module 1570 and the process repeats. Note that such feedback may include automatic feedback based on the process results, and user feedback such as modifications to results, review of test results, and addition or subtraction from a black list of spam feeds, for example.
  • FIG. 14 illustrates another embodiment of a process of filtering spam feeds.
  • Process 1600 includes receiving a feed update, determining a type of feed, running probes against the feed, scoring results, receiving feedback, and adjusting probes.
  • the spam filter may be in a more or less continuous process of evaluating and adjusting, while feeds update on an asynchronous basis.
  • Data is received at module 1610, such as when a feed updates and either is found with new data or pushes new data, for example.
  • the type of feed is evaluated at module 1620, such that appropriate probes may be run against the feed or appropriate tests may be performed.
  • the actual probes are run, potentially with internal probe adjustments for the type of feed as well, and also with any potential probe interchange occurring.
  • results of the probes are scored through a Bayesian evaluator.
  • feedback is received based on the evaluation and the probe results. This may be optional - there may be no feedback. Such feedback may be automatic within the system or user-generated. Such feedback may then results in adjustments to the probes (or scoring methodology) at module 1660.
  • feedback at module 1660 of process 1600 or at module 1570 of process 1500 can take on a variety of forms, and allows the system to implement a machine-learning approach to improvement, among other adaptations.
  • feedback may result in changes in the values of equations discussed within this document, tuning the approach of various probes in the process.
  • the structure of the probes and linkages may change as a result of feedback, such that non-linear structures may be achieved, and relationships between probes may become more complicated over time.
  • the feedback is likely to make changes over time, but to continue to provide a filter which attempts to evaluate factors of a complex, non-linear relationship between word count, content, and form of a feed to determine the presence of spam.
  • a feed spam filter may be illuminating.
  • a feed is received by a feed search engine and the content is determined to be a Real Estate feed.
  • the spam filter would know through an ontology hierarchy that for Real Estate feeds there are lots of pictures, very little text and likely to be particular key term loadings. Hence probes that compensate for these characteristics should be used and perhaps the threshold for spam may differ than for other contexts (e.g. a literary weblog, for example).
  • Specialist client specific probes could be built and run for specific feed search engine products that deal with contexts and feeds specific to the client's realm of interest.
  • the spam filter will involve some form of machine learning, potentially based on human user and ontology based training, for example.
  • the weightings and structure of the spam filter may (and probably should) be adjusted to concur with real world experience.
  • the factors that drive all the features described above are able to be adjusted based on verification feedback that is provided to the spam filter.
  • This means that the weightings and constants involved in formulae must be stored in a configurable datastore so these figures can change dynamically based on verification feedback from both human and software driven sources.
  • FIGs. 15-16 are intended to provide an overview of device hardware and other operating components suitable for performing the methods of the invention described above and hereafter, but is not intended to limit the applicable environments. Similarly, the hardware and other operating components may be suitable as part of the apparatuses described above.
  • the invention can be practiced with other system configurations, including personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • the invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • FIG. 15 illustrates an embodiment of a system or network in which a feed spam filter may operate.
  • FIG. 16 illustrates an embodiment of a system which may operate with a feed spam filter.
  • FIG. 15 illustrates an embodiment of a system or network in which a feed spam filter may operate.
  • FIG. 16 illustrates an embodiment of a system which may operate with a feed spam filter.
  • network 15 shows several computer systems that are coupled together through a network 1705, such as the internet, along with a cellular or other wireless network and related cellular or other wireless devices.
  • a network 1705 such as the internet
  • the term "internet” as used herein refers to a network of networks which uses certain protocols, such as the TCP/IP protocol, and possibly other protocols such as the hypertext transfer protocol (HTTP) for hypertext markup language (HTML) documents that make up the world wide web (web).
  • HTTP hypertext transfer protocol
  • web hypertext markup language
  • ISP internet service providers
  • client systems such as client computer systems 1730, 1750, and 1760 obtain access to the internet through the internet service providers, such as ISPs
  • the web server 1720 is typically at least one computer system which operates as a server computer system and is configured to operate with the protocols of the world wide web and is coupled to the internet.
  • the web server 1720 can be part of an ISP which provides access to the internet for client systems.
  • the web server 1720 is shown coupled to the server computer system 1725 which itself is coupled to web content 1795, which can be considered a form of a media database. While two computer systems 1720 and 1725 are shown in FIG. 15, the web server system 1720 and the server computer system 1725 can be one computer system having different software components providing the web server functionality and the server functionality provided by the server computer system 1725 which will be described further below.
  • Cellular network interface 1743 provides an interface between a cellular network and corresponding cellular devices 1744, 1746 and 1748 on one side, and network 1705 on the other side.
  • cellular devices 1744, 1746 and 1748 which may be personal devices including cellular telephones, two-way pagers, personal digital assistants or other similar devices, may connect with network 1705 and exchange information such as email, content, or HTTP- formatted data, for example.
  • Cellular network interface 1743 is representative of wireless networking in general. In various embodiments, such an interface may also be implemented as a wireless interface such as a Bluetooth interface, IEEE 802.11 interface, or some other form of wireless network. Similarly, devices such as devices 1744, 1746 and 1748 may be implemented to communicate via the Bluetooth or 802.11 protocols, for example. Other dedicated wireless networks may also be implemented in a similar fashion.
  • Cellular network interface 1743 is coupled to computer 1740, which communicates with network 1705 through modem interface 1745.
  • Computer 1740 may be a personal computer, server computer or the like, and serves as a gateway. Thus, computer 1740 may be similar to client computers 1750 and 1760 or to gateway computer 1775, for example. Software or content may then be uploaded or downloaded through the connection provided by interface 1743, computer 1740 and modem 1745.
  • Client computer systems 1730, 1750, and 1760 can each, with the appropriate web browsing software, view HTML pages provided by the web server 1720.
  • the ISP 1710 provides internet connectivity to the client computer system 1730 through the modem interface 1735 which can be considered part of the client computer system 1730.
  • the client computer system can be a personal computer system, a network computer, a web tv system, or other such computer system.
  • the ISP 1715 provides internet connectivity for client systems 1750 and 1760, although as shown in FIG. 15, the connections are not the same as for more directly connected computer systems.
  • Client computer systems 1750 and 1760 are part of a LAN coupled through a gateway computer 1775.
  • FIG. 15 shows the interfaces 1735 and 1745 as generically as a "modem,” each of these interfaces can be an analog modem, isdn modem, cable modem, satellite transmission interface (e.g. "direct PC"), or other interfaces for coupling a computer system to other computer systems.
  • Client computer systems 1750 and 1760 are coupled to a LAN 1770 through network interfaces 1755 and 1765, which can be ethernet network or other network interfaces.
  • the LAN 1770 is also coupled to a gateway computer system 1775 which can provide firewall and other internet related services for the local area network.
  • This gateway computer system 1775 is coupled to the ISP 1715 to provide internet connectivity to the client computer systems 1750 and 1760.
  • the gateway computer system 1775 can be a conventional server computer system.
  • the web server system 1720 can be a conventional server computer system.
  • FIG. 16 shows one example of a personal device that can be used as a cellular telephone (1744, 1746 or 1748) or similar personal device, or may be used as a more conventional personal computer, as an embedded processor or local console, or as a PDA, for example.
  • a personal device can be used to perform many functions depending on implementation, such as monitoring functions, user interface functions, telephone communications, two-way pager communications, personal organizing, or similar functions.
  • the computer system 1800 interfaces to external systems through the communications interface 1820.
  • this interface is typically a radio interface for communication with a cellular network, and may also include some form of cabled interface for use with an immediately available personal computer.
  • the communications interface 820 is typically a radio interface for communication with a data transmission network, but may similarly include a cabled or cradled interface as well.
  • communications interface 1820 typically includes a cradled or cabled interface, and may also include some form of radio interface such as a Bluetooth or 802.11 interface, or a cellular radio interface for example.
  • the computer system 1800 includes a processor 1810, which can be a conventional microprocessor such as an Intel Pentium microprocessor or Motorola power PC microprocessor, a Texas Instruments digital signal processor, or some combination of the various types or processors.
  • Memory 1840 is coupled to the processor 1810 by a bus 1870.
  • Memory 1840 can be dynamic random access memory (dram) and can also include static ram (SRAM), or may include FLASH EEPROM, too.
  • the bus 1870 couples the processor 1810 to the memory 1840, also to non-volatile storage 1850, to display controller 1830, and to the input/output (I/O) controller 1860. Note that the display controller 1830 and I/O controller 1860 may be integrated together, and the display may also provide input.
  • the display controller 1830 controls in the conventional manner a display on a display device 1835 which typically is a liquid crystal display (LCD) or similar flat-panel, small form factor display.
  • the input/output devices 1855 can include a keyboard, or stylus and touchscreen, and may sometimes be extended to include disk drives, printers, a scanner, and other input and output devices, including a mouse or other pointing device.
  • the display controller 1830 and the I/O controller 1860 can be implemented with conventional well known technology.
  • a digital image input device 1865 can be a digital camera which is coupled to an I/O controller
  • the non-volatile storage 1850 is often a FLASH memory or read-only memory, or some combination of the two.
  • a magnetic hard disk, an optical disk, or another form of storage for large amounts of data may also be used in some embodiments, though the form factors for such devices typically preclude installation as a permanent component of the device 1800.
  • machine-readable medium or “computer-readable medium” includes any type of storage device that is accessible by the processor 1810 and also encompasses a carrier wave that encodes a data signal.
  • the device 1800 is one example of many possible devices which have different architectures.
  • devices based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 1810 and the memory 1840 (often referred to as a memory bus).
  • the buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
  • the device 1800 is controlled by operating system software which includes a file management system, such as a disk operating system, which is part of the operating system software.
  • a file management system such as a disk operating system
  • One example of an operating system software with its associated file management system software is the family of operating systems known as Windows CE® and Windows® from Microsoft Corporation of Redmond, Washington, and their associated file management systems.
  • Another example of an operating system software with its associated file management system software is the Palm® operating system and its associated file management system.
  • the file management system is typically stored in the non-volatile storage 1850 and causes the processor 1810 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 1850.
  • Other operating systems may be provided by makers of devices, and those operating systems typically will have device-specific features which are not part of similar operating systems on similar devices.
  • WinCE® or Palm® operating systems may be adapted to specific devices for specific device capabilities.
  • Device 1800 may be integrated onto a single chip or set of chips in some embodiments, and typically is fitted into a small form factor for use as a personal device. Thus, it is not uncommon for a processor, bus, onboard memory, and display/I-O controllers to all be integrated onto a single chip. Alternatively, functions may be split into several chips with point- to-point interconnection, causing the bus to be logically apparent but not physically obvious from inspection of either the actual device or related schematics.
  • the afore described computer system may advantageously provide a particular spam filter system that provides a method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
  • aspects of the invention may also independently provide a method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
  • Embodiments of the invention may also provide a spam feed filter for filtering spam in a feed, the filter using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
  • references to processes are understood to be performed in a computer having a processor and a memory coupled to the processor. These computers may be considered to be systems or subsystems or functional blocks depending upon the architecture of the overall system and for example, the distribution of functional responsibilities within the overall system.
  • the feed crawling may occur in or be performed by a network server or servers adapted to perform the feed crawling method.
  • the present invention also relates to apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • ROMs read-only memories
  • RAMs random access memories
  • EPROMs electrically erasable programmable read-only memories
  • EEPROMs electrically erasable programmable read-only memory
  • magnetic or optical cards or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • the algorithms and displays presented herein are not inherently

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

A feed crawling system, method, and computer program product. A spam filter and method for filtering. A system and method for feed crawling with spam filtering. A computer system and associated method and computer program product for crawling content feeds, the computer system comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.

Description

FEED CRAWLING SYSTEM AND METHOD AND SPAM FEED FILTER
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of priority under 35 U.S.C. 119 to U.S. Provisional Patent Application Serial No. 60/825, 114 entitled Feed Crawling System filed 08
September 2006 and naming James Ruga as inventor; to U.S. Provisional Patent Application Serial No. 60/824,903 entitled Spam Feed Filter filed 07 September 2006 and naming Rebecca Berrigan as inventor; to U.S. Patent Application Serial No. 11/850,592 entitled Feed Crawling System and Method and Spam Feed Filter filed 05 September 2007 and naming James Ruga and Rebecca Berrigan as inventors; and to U.S. Patent Application Serial No. 11/850,577 entitled
Feed Crawling System and Method filed 05 September 2007 and naming James Ruga as inventor.
FIELD OF THE INVENTION This invention pertains generally to systems and methods for developing information from a network of computers such as from the Internet, and more particularly to systems and methods for feed crawling and for storing and analyzing the data or information from the crawled feeds.
BACKGROUND
RSS and Atom are content syndication specifications of XML feed formats that can be used in order to publish text, images, audio, video and combinations of these items. Collectively, in general these are content feeds that can be easily published and instantly syndicated out to millions of potential users on the Internet. Crawling (or spidering) is a common method of gathering large bodies of information from the Internet. Crawlers generally have similar characteristics and crawler design is typically focused on gathering as much information as possible and storing that information for later cataloging and analysis and search.
A general crawling process typically includes a number of traits. The process typically includes a method of sending website urls to the crawler. The process typically also includes a method of sending/receiving requests from the crawler to the Internet. Further, the process typically includes a method of parsing the results received from the Internet to gather data. Moreover, the process typically includes a method of storing the data for indexing, search and analysis.
There has been previous work done by several people to examine various ways to improve crawler efficiency by relevancy for HTML pages, but the focus of this work has link analysis, navigation structure, text size, text emphasis, and the like, in order to guide a crawler in determining what pages and content are relevant to a particular topic. In particular, hyperlinks and page navigation are common elements in almost all research on this topic. Much of the previous work in this area was done in early 1999 - 2000, and more recent work has not revisited this issue within the scope of the changes to web document publishing formats which have changed significantly with the introduction of XML feeds.
Feeds are syndicated XML messages announcing updates (of any type of information) to a web site on the internet. A feed search engine may collect internet feeds and index them to provide a searchable repository for the public. The repository may include feeds that are deemed to masquerade as information updates yet are not original, intelligible or informative and one may refer to these feeds as Spam. Likewise, entire web sites are constructed to imitate a weblog with the primary objective being revenue generation for the owner of the weblog via web click based advertising and/or leading the user to a website where the user is exposed to advertising or enticed to purchase goods. These weblogs are referred to as 'Splogs' as they are considered spam, yet take the form of a weblog.
Accurate statistics of the occurrence of spam in the feed realm are not available as the definition of spam is subjective — yet some initial estimations based on the inventor's research are of the order of 40% of all feeds encountered by feed crawlers represent spam. This means that around 40% of the feeds in a searchable index of feeds are undesirable for a user to see in search results and this poses a tangible compromise in perceived quality of a feed search engine.
Spam is not new to the internet realm. Currently much effort has been directed towards filtering spam in email. Email spam filtering often uses various techniques to identify spam. Weblog discussions of how a Bayesian Filter to filter 'Weblog Spam1 or 'Splogs' from the Blogsphere have occurred. However, the current focus is more targeted towards eliminating spam being posted as comments on the blogger's site, rather than the wider focus of feeds in general.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example in the accompanying drawings. The drawings should be understood as illustrative rather than limiting.
FIG. 1 illustrates an embodiment of a crawler system.
FIG. 2 illustrates another embodiment of a crawler system.
FIG. 3 illustrates an embodiment of a job server used in a crawler system.
FIG. 4 illustrates an embodiment of a process of dispatching jobs to crawlers. FIG. 5 illustrates an embodiment of a process of crawling feeds. FIG. 6 illustrates an embodiment of a process of determining how to reprioritize urls.
FIG. 7 illustrates an embodiment of a network which may be used in conjunction with embodiments of a feed crawling system.
FIG. 8 illustrates an embodiment of a machine which may be used in conjunction with embodiments of a feed crawling system.
FIG. 9 illustrates an embodiment of a feed spam filter.
FIG. 10 illustrates another embodiment of a feed spam filter.
FIG. 11 illustrates an embodiment of a feed spam filter with inputs and outputs.
FIG. 12 illustrates an embodiment of a system including a feed spam filter. FIG. 13 illustrates an embodiment of a process of filtering spam feeds.
FIG. 14 illustrates another embodiment of a process of filtering spam feeds.
FIG. 15 illustrates an embodiment of a system or network in which a feed spam filter may operate.
FIG. 16 illustrates an embodiment of a system which may operate with a feed spam filter.
SUMMARY
A feed crawling system, method, and computer program product. A spam filter and method for filtering. A system and method for feed crawling with spam filtering.
A computer system for crawling content feeds, the computer system comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.
A method for crawling content feeds, the method comprising: at least one processor for executing at least one process; providing a database providing a storage for storing location information or universal reference locators (urls); executing a first process for prioritizing a list of urls to be crawled; executing a parallelized crawler process for crawling the urls and storing the results in the database; and executing an indexing process for indexing the database for a user to search.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
A system, method and apparatus is provided for a feed crawling system. The specific embodiments described in this document represent examples or embodiments of the present invention, and are illustrative in nature rather than restrictive. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Features and aspects of various embodiments may be integrated into other embodiments, and embodiments illustrated in this document may be implemented without all of the features or aspects illustrated or described.
A feed search system may use a crawler system to collect the data from feeds and organize it into a searchable body of knowledge, which is convenient for an end user to search based on keywords representing topics. However, the crawling process by its nature is typically random in the order in which feeds are crawled. This means that feeds are actually cataloged in a searchable database in the order in which they were crawled, and there is no predictable way to dictate when information on a particular topic will become available in the search results. In other words, a user searching for information on a particular event of interest to them, whether that event is regional or global, may or may not find information on that event simply due to the fact that a particular feed may or may not have been crawled on that topic. This may greatly impair the usability of the search engine.
The timeliness of information in the realm of feeds is important - and one reason for the existence of feeds. The ease with which information can be syndicated means that more people can publish information on regional and global events faster than all of that information can be collected and cataloged by any crawling system. It is thus potentially important to prioritize the crawling process and intelligently collect data from feeds that are of more interest to search engine users, before crawling information that is of less interest. XML feeds can update several times per minute, and can carry staggering amounts of information compared to HTML. The combination of these factors makes the needs of crawling XML feeds a specialized problem to solve. Thus, a feed crawling system may differ from HTML crawlers in a number of ways. A single XML feed can carry large amounts of information (data) and very little style information, hence there is relatively little work to do in parsing the data. HTML crawling involves a great deal of parsing (separating data from style information) and is prone to tainting the relevancy of the data (spamming).
Updates occur more frequently on feeds than on HTML only web sites; so feeds must be crawled frequently to index new data in a timely fashion. Each feed overall must be prioritized based on amount of relevance to a particular topic for efficiency; while at the same time classifying each data item from the feed according to its relevancy to the same or similar list of topics. Since there is potentially several orders of magnitude more raw data from feeds than from HTML websites, only feeds that have topics of interest to users of the search engine should be crawled in some embodiments. In other embodiments, frequency of crawling is determined based on apparent relevance. Additionally, a crawler may distinguish possible spam feeds and posts from genuine feeds and posts on a particular topic to make search results more useful. HTML crawlers are focused on crawling everything and then sort out the information, post- crawl, whereas feed crawlers may lose valuable time by waiting to determine whether a feed is relevant. In one embodiment, a feed crawling system uses a combination of components from a feed search engine to determine what feeds to crawl and at what priority. These search engine components include: a relevancy engine, a spam filter, an indexer and a search API. The anatomy of the crawler itself includes a crawler job server where input from these components is synthesized to make the decisions necessary to control a parallelized distributed crawling network of machines.
In one embodiment, this solution to crawling XML feeds incorporates much of the accepted methods of managing networks of crawlers and parallelization. It builds on these methodologies to make the crawling more efficient from a hardware usage perspective, and at the same time attempts to optimize the available hardware by attempting to optimize what the crawler spends time crawling. The crawler job server helps make the whole system efficient and creates the statistical information used to make critical decisions about where to allocate crawling resources. This potentially provides an ability to adapt the crawls in order to respond the need for results in searches on keyword terms. Such adaptability allows for crawling of feeds to occur in a timely manner based on what users currently are searching for, and thus are likely to continue to search for.
One embodiment of a crawling system is shown in FIG. 1. Another embodiment of the system is shown in FIG. 2. Various parts of the system may be expected to perform various tasks. The parallelized distributed crawler is the part of the crawler that does the actual work in some embodiments. This over-emphasizes the importance of the parallelized distributed crawler, but the parallelized distributed crawler may perform a number of tasks in one embodiment. The parallelized distributed crawler may determine if a particular url/domain has been blacklisted (spam) - this may be through use of a separate spam filter, for example. The parallelized distributed crawler may also request the content from a url. Moreover, the parallelized distributed crawler may determine if the received content is in fact an XML feed. Also, the parallelized distributed crawler may determine what type of XML feed data has been received, such as atom, rss, rdf or opml, for example, if an HTML page is received instead of a valid XML feed, the parallelized distributed crawler may attempt to find a corresponding XML feed for a site. For an actual feed, the parallelized distributed crawler may prepare the feed for parsing (normalization to UTF8, decoding characters, correcting well-form ed-ness errors, for example). Additionally, the parallelized distributed crawler may parse the feed (e.g. extracting the data based on the feed type schema, for example). Likewise, the parallelized distributed crawler may apply a spam filter to determine if the content (and/or feed) is spam. This may be separate, based on received content, rather than based on a url, for example.
The parallelized distributed crawler may analyze the received data to determine attributes (and/or metadata), for example. This may include type of content (text, images, audio, video, etc). It may also include overall language of the content (and of the feed by percentage, for example). It may further include relevant taxonomy of the content (and of the feed). In some embodiments, the parallelized distributed crawler may store the data into a database schema that is suitable for the indexer to work from. In one embodiment, the parallelized crawlers run on one computer consuming as much of its resources as is practical (or as is allowed). Parallelization may be simply accomplished by spawning off child crawler processes each with their own unique list of urls to process. Alternatively, this can be accomplished by threading on some platforms. On completion of the processing of the list of URLs, the child crawler reports back to the parent with the results of the crawl; and the parent is then free to spawn a new child with more urls to crawl.
Turning to the figures, FIG. 1 illustrates an embodiment of a crawler system. Generic processes 110 include last indexed crawler 103, fast index crawler 106, discovery crawler 109, manual crawler 112 which receives urls from a batch process 115 and ping crawler 118. Most of the generic processes draw data from a database 130 of urls and content to determine which urls should be crawled next, or to prioritize or reprioritize a list of urls for crawling. Ping crawler 118 receives data from a pings database 140 of pings from feeds indicating a feed update. Last indexed crawler 103 thus indicates what was least recently indexed, fast index crawler 106 indicates what url typically updates quickly and has not been crawled recently, discovery crawler 109 indicates what urls are likely to be new feeds, and manual crawler 112 indicates what urls a user has submitted for crawling, for example. A generic parallelized crawler 120 is implemented through use of a number of child crawler processes 125. Results of the child crawler processes 125 are fed through finishing routines 135, which can parse feeds, filter for spam, retrieve feeds in some instances, and provide a database interface, for example. Crawler 120, upon crawling a feed, provides a UDP broadcast to feed stream 170 and to the rest of the system. Feed stream 170 may provide information to a TCP client 175 indicating a feed was crawled, allowing for external notification. A firewall protects the system from unauthorized external access.
The system also receives authorized information from internet 165 through feed mesh 160 (a series of hooks into feed subscriptions, for example) and through apache server 155 which may be interfaced with a ping api. Data may also be received through http/telnet interface 150. All of this received data provides inputs to pings database 140, and may provide information to directory and change file robot 145. Robot 145 may be expected to maintain a directory of information about feeds, and to detect when new feeds are introduced, feeds are no longer in existence, feed data formats change, or other changes occur.
In another embodiment, this system is distributed among various machines or processors. The crawlers can co-exist on multiple networked machines simultaneously, without interfering or overlapping with jobs being crawled by the other machines. As the load number of incoming jobs to the system increases, jobs may be distributed to crawler machines in the network that are available to handle the added work, such as through load-balancing techniques.
It may also be useful to allow the distributed system to recruit idle machines allocated to other tasks within a network, but this need not be implemented to provide the crawler.
The generic processes (front end job interfaces) are embodied as smaller bits of code that exist separate from the actual crawling code itself in some embodiments. Each generic process is specialized in some way to draw its input from various sources of potential feeds to crawl. These generic processes can perform various functions. For example, a process may select and compile the actual list of urls from the database that needs to be crawled. Similarly, a process may periodically check existing HTML web sites for the addition of, or replacement by XML feeds. Likewise, a process may select lists of feed urls to crawl by periodic interval or select lists of feed urls to crawl by overall age since last crawl. Moreover, a process may crawl web sites and feed urls on demand as submitted one at a time. Other processes may be implemented in various embodiments as well.
Another embodiment of a system may be illustrative of alternative implementation opportunities. FIG. 2 illustrates another embodiment of a crawler system. Generic configurators 210 generally use processes previously described with respect to FIG. 1. However, adaptive crawler input 290 receives data from search terms 293 and topical events 296 to provide further options for urls to be crawled. Similarly, sequential ping crawler 219 and probabilistic ping crawler 216 use ping data to determine which recently updated feeds to crawl next, with some chosen based on a first-in-flrst-out model and others chosen based on probability that the updated feed will be searched for.
Job server 285 determines which urls should be crawled next by parallelized crawler 120, and directs dispatch of child crawlers 125 in some embodiments. Also, data is stored in a distributed database 230. Thus, the embodiment of FIG. 2 may be adaptable to changing searches of feeds, and may produce more timely results than a crawler without adaptive features.
The crawler job server provides business intelligence and performs other functions as well. The job server may accept the lists of urls from the generic processes and determine if urls are relevant to user searched topics. The job server may also mine known XML feeds (and HTML websites to locate additional XML feeds) for additional relevant content. Moreover, the job server may prioritize the urls to be crawled. Likewise, the job server may break up the lists of urls into jobs to assign to the distributed crawler machines with available resources and throttle crawl frequency on feeds that have relevance to a popular topic. Additionally, the job server may delay jobs or stop jobs on the crawler machine network in favor of crawling more popular content on demand. The job server may also record and map topical and statistical trends on topics for advertising targeting and report the statistics in a human readable log file or user interface, for example.
A further illustration of the job server used in some embodiments may also provide additional insight. FIG. 3 illustrates an embodiment of a job server used in a crawler system. System 300 includes primarily the job server and parallelized crawler, and surrounding components. Job server 285 uses a pool of urls 305 with either an ordered list of tags indicating priority to determine which jobs to dispatch. A url is first checked by spam filter 315 for spam characteristics. The url then is prioritized by prioritizing routines 325, determining if the url is relevant, currently popular, fits an emerging trend, or was selected by a user, for example, and then is passed to the network control of crawler 120. A search api 335 used to search feeds may provide data on what is currently popular, for example.
Crawler 120 causes a child process 125 to crawl the url in question, determining what data is available through the url. Crawl job results 365 are thus produced, and may feed back through throttle adjustment module 375 to alter operation of crawler 120. Results 365 are also compiled into statistics through statistics generation module 355 and provided to database 130. Moreover, statistics generator 355 may also supply information to ad server 385 to allow for monetization or commercialization of the feed search engine and crawler through advertising.
In some embodiments, the overall crawler relies on feedback from multiple systems. Each such system may have its own specifications with unique properties. Various portions of the crawler focus on how that information is used to make decisions to perform a more efficient crawl.
The crawler may use a sorting and prioritizing algorithm to determine what urls to crawl. The crawler job server accepts urls from multiple inputs including lists of urls that it generates itself. These lists represent a possible pool of urls to crawl. From this pool, the urls with most relevance to the each of the search terms entered by users of the search engine should be crawled first. This way posts from the feeds on these urls can be added to the search results as quickly as possible, making the search results timely.
This is accomplished in one embodiment by assigning each url a relevancy value between 0 and 1, for example, which is calculated based on multiple components. (Such components may include semantic relevancy to keywords and/or groups of keywords, frequency of new postings on that url, popularity of the overall topic, authority of a feeds content on a particular subject, for example.) Given a list of search terms along with a frequency of use in user searches for each term or group of terms, a crawling priority can be assigned to each url as a function of the relevancy value and the demand for results on a topic (implied from frequency of searches on a particular term over time). At any given instant a snapshot of the pool of urls to be crawled may be sorted based on this priority.
Various processes may be used to implement crawlers and job dispatch functions. FIG. 4 illustrates an embodiment of a process of dispatching jobs to crawlers. Process 400 includes generating a list of urls, receiving updated information, receiving environmental information, reordering the list of urls, dispatching urls to crawlers, receiving responsive data from crawlers and integrating the responsive data into a database. Process 400 and other processes of this document are implemented as a set of modules, which may be process modules or operations, software modules with associated functions or effects, hardware modules designed to fulfill the process operations, or some combination of the various types of modules, for example. The modules of process 400 and other processes described herein may be rearranged, such as in a parallel or serial fashion, and may be reordered, combined, or subdivided in various embodiments.
Process 400 initiates with generation of a list of urls at module 410. This may involve looking up urls or retrieving a previously generated list of urls, and the list may be more in the form of a tagged set of urls rather than a traditional list. At module 420, update information about feeds is received, such as indications that some feeds have recently been updated, for example. Similarly, at module 430, environmental information about searches conducted and external events is received. The data from modules 420 and 430 is used at module 440 to reorder the list of urls or otherwise reprioritize urls for crawling purposes. UrIs are then dispatched to a crawler or crawlers to be retrieved at module 450. Responsive data is received from the crawlers at module 460, and the responsive data is integrated into a database or repository at module 470.
Further insight into the process may also be obtained from an understanding of a crawling process. FIG. 5 illustrates an embodiment of a process of crawling feeds. Process 500 includes receiving a url job, filtering for spam, requesting data at the url, determining if the content is usable, determining a type of feed or attempting to translate data, parsing a feed, filtering a feed, skipping a feed if necessary, analyzing a feed, and returning results.
Process 500 begins with receipt of a url job from a job dispatch process, for example, at module 510. At module 515, a determination is made as to whether the url is a spam url. If so, the url is skipped at module 590, and results are returned at module 580.
If the url is not spam, data at the url is requested at module 520. If the content is determined to not be usable at module 530, an attempt is made to translate the feed into something usable at module 535. Such translation may involve transformation of HTML data into XML data, for example. If the translation cannot work, the feed is skipped at module 590. If translation works, or is not needed, the type of feed is determined at module 540. At module
550, the feed data is parsed. Additionally, at module 560, the feed data is filtered, for spam for example, and the feed may be skipped at module 590 if it appears to mainly contain spam. At module 570, the feed is analyzed, extracting information and categorizing that information based on an overall hierarchy, for example. At module 580, results are then returned. Many feeds ping subscribers upon updates to a feed. Feed publishers that wish to be found and indexed within a feed search engine can ping an external publicly available api when they publish updated information on their feeds. This can then trigger the crawler to enter the url to their feed into a queue which is then crawled in the order in which it was received (roughly). By employing the sorting and prioritizing algorithms above, the crawler job system can specifically select urls that are more important to crawl first from these external pings. This potentially greatly increases the speed at which new data appears in the search results.
Adapting prioritization of urls provides much power to the system, potentially allowing for such improved search results. FIG. 6 illustrates an embodiment of a process of determining how to reprioritize urls. Process 600 may represent a process or a set of modules implementing functions, for example. Process 600 includes receiving external data from various sources, analyzing changes in such data, and determining which feeds are thereby affected.
Process 600 includes receiving feed update notifications at module 610, receiving search query data at module 620 and receiving actual updated data at module 630. All of these modules feed into analysis of changes in data at module 640, where a determination is made as to how searches are changing, what types of feeds are becoming more active (and thus potentially more interesting to searchers) and what types of data is changing. This analysis results in checks on feeds of at least three different types in one embodiment. At module 650, feeds are searched for and tagged based on relevance to external data changes — the direction external data appears to be moving in. At module 660, feeds with recent updates are tagged, potentially in conjunction with some indication of relevance. At module 670, feeds not recently crawled are tagged, again potentially in conjunction with some indication of relevance. The tagged feeds (or corresponding urls) may then be prioritized into a list or reprioritized if already part of a list. Another potential benefit of this method is for removal of pings that lead to urls that have spam feeds. Essentially feeds that lead to spam would also have a low crawling priority and so they would either be delayed or simply not crawled at all (depending on their priority and degree of spam as calculated by the spam filter).
The crawler may also mine known XML feeds and HTML sites to locate additional XML feeds for additional relevant content. Variations in the results of searches on blogs or feeds are typically temporal in nature. Any particular set of search results on any given topic is theoretically outdated the instant any person on the Internet publishes something new about that topic. In these cases, the crawler may respond by immediately crawling the new content to add the new content to the search results. Often, for more obscure search terms, and groups of terms there is a limited amount of search results available. When this happens for search terms in relation to popular events (such as breaking news), the crawler may recognize both a change in the search trends (increase in popularity of that particular search) and the lack of content available for results. The response to this is to begin crawls to discover possible new content from the known list of feeds that would add to the existing search results. This mechanism depends to some degree on keeping historical references to a baseline frequency of searches for certain topics and an average expected number of results. Then, as part of the prioritizing process, the total results for any given popular search can be queried directly from the API. If the number of results for a particular topic is not sufficient based on some previously defined threshold, then the crawler job server may find more urls to add to the crawling pool in relation to that particular search. To satisfy this quota for content on a search term(s), the database is queried for the known feeds on a topic according to a pre-defined taxonomy to find feeds that have one, a relatively high post rate and two, have not been crawled recently. The crawler then selects these urls to crawl sorting them based on the average post rate, in an attempt to discover additional content to add to the search results. This essentially allows the crawler to mine the data in a list of known XML feeds. However, it depends on having the feeds categorized via a combination of the relevancy engine and a potentially human-defined and -reviewed taxonomy.
For cases of exceptionally low numbers of search results, the crawler can also examine a list of HTML sites to see if XML feeds can be discovered from sites that actually contain search term keywords within the domain name of the url. This is potentially a "last attempt" to mine data within the database for additional content. The HTML list may be composed of sites that have been submitted for indexing (via a feed search engine's ping API, for example) and failed on previous crawls to have a corresponding feed found. These sites are saved and re-examined periodically, but in the case of actually needing additional search results this mechanism may increase the frequency with which certain HTML sites are rechecked for new feeds.
Throttling crawl frequency in response to search topic popularity may be used to increase the effectiveness of crawling in some embodiments. Aside from simply triggering mining of the database for a quantity of search results, there maybe times when a search topic has a sustained period of activity, and thus would warrant repeated frequent crawls on XML feeds that had high relevance to that search. An example of this would be when random events become popular topics or breaking news, such as Hurricane Katrina, a Mars probe, or an unexpected political transformation, to name a few examples.
The crawler job server detects the appearance of a new term or combination of terms from the search api that either has not been previously searched or has not been searched for an extended period of time. In this case the crawler job server analyzes the searches to see if a growth pattern has begun to emerge in the search frequencies of this search. In response to determining a clear growth pattern over a short period of time to a particular search, the crawler then compiles a list of sites that had high relevancy to that search and performs repeated crawls on these sites at regular intervals to gather new content as quickly as possible. Likewise, as these searches subsided the relevant sites would be crawled less frequently.
Crawler network control can further be used to improve crawler effectiveness in some embodiments. All of the inputs to the crawler job server are "post search"; meaning the input is always received after the search is performed. This causes the entire crawling system to be responsive to emerging trends of search patterns rather than have a pre-defined sequence of events or timings it must follow. This is useful to the design especially when dealing with feeds. Other crawling systems search outward from a seed site and attempt to rank based on the content in relation to the seed site. This idea is potentially flawed when cataloging feeds because the calculated rank based on the content would change as soon as a new post is made to the feed. This makes the relevancy ranking nearly impossible to calculate. A feed crawler is thus typically less of a traditional crawler and more of a search tool because of the way it responds to the input from the search api and relevancy engines. It starts with sites that were last known to have content most like what is being searched for, and then populates or updates the database with as much appropriate content for the current trend in searches as it can find.
The distributed aspects of the crawling system overall are potentially helpful to this goal. The crawler job server is able to communicate to the individual machines in the crawler network. It is able to interrupt crawls on lower priority urls in order to recruit crawling resources to crawl more relevant higher priority urls. This allows the crawler to speed the gathering of results for searches that show a clear increase in popularity.
Additionally, the crawler job server "balances" the work of crawling over the network of crawler machines. This allows the network to be composed of heterogeneous hardware, such that a job composed of several urls to be crawled can be handed to any machine in the network and the next job would not be handed to that same machine until it had reported back to the crawler job server a successful job completion, for example.
In terms of commercial operation, crawl notification streaming can be useful in some embodiments. Search engine partners may want to know precisely what urls have been recently crawled for various purposes. Because of the complexities in choosing what urls get crawled and in which priority, this can be difficult. To address this problem the generic parallelized crawlers may broadcast to the internal network via UDP the results of successful crawls in an embodiment. These broadcasts can be picked up by a crawl-streaming server, which then assembles them as received into a stateless spool that can be listened to via a network port on a TCP connection. Moreover, such a broadcast can also be used to maintain information about which urls are successfully crawled and which are not, thereby providing data about which urls are still active, for example.
Mapping crawler statistical trends on topics for advertising targeting can also be a useful component of a crawler in some embodiments. The same data used to allow the crawler to monitor searches and changes in search frequencies over time, as well as the same information that allows the crawler to determine the relevance of one url or feed to any particular topic, also can be used to target advertising to search results. In one overall search engine architecture, the crawler is the earliest possible single component that has the ability to calculate priority and changes in search topic trends over time.
Thus, it may be logical to have the same data that is actively being calculated and stored by the crawler job server to also be available to the targeting system of an ad server. This can be accomplished very simply by either direct database interface or by UDP internal broadcast of data to the ad server. That data may then be used to select and cache groups of ads appropriate to deliver with weighted frequency on pages that display search results or directly within subscribed rss feeds of search results.
While various systems and processes have been described, different overall networks and machines may be expected to operate with or as part of such systems and processes. FIG. 7 illustrates an embodiment of a network which may be used in conjunction with embodiments of a feed crawling system. FIG. 8 illustrates an embodiment of a machine which may be used in conjunction with embodiments of a feed crawling system. The following description of FIGs. 7-8 is intended to provide an overview of device hardware and other operating components suitable for performing the methods of the invention described above and hereafter, but is not intended to limit the applicable environments. Similarly, the hardware and other operating components may be suitable as part of the apparatuses described above. The invention can be practiced with other system configurations, including personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
FIG. 7 shows several computer systems that are coupled together through a network 705, such as the internet, along with a cellular or other wireless network and related cellular or other wireless devices. The term "internet" as used herein refers to a network of networks which uses certain protocols, such as the TCP/IP protocol, and possibly other protocols such as the hypertext transfer protocol (HTTP) for hypertext markup language (HTML) documents that make up the world wide web (web). The physical connections of the internet and the protocols and communication procedures of the internet are well known to those of skill in the art. Access to the internet 705 is typically provided by internet service providers (ISP), such as the ISPs 710 and 715. Users on client systems, such as client computer systems 730, 750, and 760 obtain access to the internet through the internet service providers, such as ISPs 710 and 715. Access to the internet allows users of the client computer systems to exchange information, receive and send e-mails, and view documents, such as documents which have been prepared in the HTML format. These documents are often provided by web servers, such as web server 720 which is considered to be "on" the internet. Often these web servers are provided by the ISPs, such as ISP 710, although a computer system can be set up and connected to the internet without that system also being an ISP.
The web server 720 is typically at least one computer system which operates as a server computer system and is configured to operate with the protocols of the world wide web and is coupled to the internet. Optionally, the web server 720 can be part of an ISP which provides access to the internet for client systems. The web server 720 is shown coupled to the server computer system 725 which itself is coupled to web content 795, which can be considered a form of a media database. While two computer systems 720 and 725 are shown in FIG. 7, the web server system 720 and the server computer system 725 can be one computer system having different software components providing the web server functionality and the server functionality provided by the server computer system 725 which will be described further below. Cellular network interface 743 provides an interface between a cellular network and corresponding cellular devices 744, 746 and 748 on one side, and network 705 on the other side. Thus cellular devices 744, 746 and 748, which may be personal devices including cellular telephones, two-way pagers, personal digital assistants or other similar devices, may connect with network 705 and exchange information such as email, content, or HTTP-formatted data, for example.
Cellular network interface 743 is representative of wireless networking in general. In various embodiments, such an interface may also be implemented as a wireless interface such as a Bluetooth interface, IEEE 802.11 interface, or some other form of wireless network. Similarly, devices such as devices 744, 746 and 748 may be implemented to communicate via the Bluetooth or 802.11 protocols, for example. Other dedicated wireless networks may also be implemented in a similar fashion. Cellular network interface 743 is coupled to computer 740, which communicates with network 705 through modem interface 745. Computer 740 may be a personal computer, server computer or the like, and serves as a gateway. Thus, computer 740 may be similar to client computers 750 and 760 or to gateway computer 775, for example. Software or content may then be uploaded or downloaded through the connection provided by interface 743, computer 740 and modem 745.
Client computer systems 730, 750, and 760 can each, with the appropriate web browsing software, view HTML pages provided by the web server 720. The ISP 710 provides internet connectivity to the client computer system 730 through the modem interface 735 which can be considered part of the client computer system 730. The client computer system can be a personal computer system, a network computer, a web tv system, or other such computer system. Similarly, the ISP 715 provides internet connectivity for client systems 750 and 760, although as shown in FIG. 7, the connections are not the same as for more directly connected computer systems. Client computer systems 750 and 760 are part of a LAN coupled ' through a gateway computer 775. While FIG. 7 shows the interfaces 735 and 745 as generically as a "modem," each of these interfaces can be an analog modem, isdn modem, cable modem, satellite transmission interface (e.g. "direct PC"), or other interfaces for coupling a computer system to other computer systems.
Client computer systems 750 and 760 are coupled to a LAN 770 through network interfaces 755 and 765, which can be ethernet network or other network interfaces. The LAN 770 is also coupled to a gateway computer system 775 which can provide firewall and other internet related services for the local area network. This gateway computer system 775 is coupled to the ISP 715 to provide internet connectivity to the client computer systems 750 and 760. The gateway computer system 775 can be a conventional server computer system. Also, the web server system 720 can be a conventional server computer system. Alternatively, a server computer system 780 can be directly coupled to the LAN
770 through a network interface 785 to provide files 790 and other services to the clients 750, 760, without the need to connect to the internet through the gateway system 775.
FIG. 8 shows one example of a personal device that can be used as a cellular telephone (744, 746 or 748) or similar personal device, or may be used as a more conventional personal computer, as an embedded processor or local console, or as a PDA, for example. Such a device can be used to perform many functions depending on implementation, such as monitoring functions, user interface functions, telephone communications, two-way pager communications, personal organizing, or similar functions. The system 800 of FIG. 8 may also be used to implement other devices such as a personal computer, network computer, or other similar systems. The computer system 800 interfaces to external systems through the communications interface 820. In a cellular telephone, this interface is typically a radio interface for communication with a cellular network, and may also include some form of cabled interface for use with an immediately available personal computer. In a two-way pager, the communications interface 820 is typically a radio interface for communication with a data transmission network, but may similarly include a cabled or cradled interface as well. In a personal digital assistant, communications interface 820 typically includes a cradled or cabled interface, and may also include some form of radio interface such as a Bluetooth or 802.11 interface, or a cellular radio interface for example.
The computer system 800 includes a processor 810, which can be a conventional microprocessor such as an Intel pentium microprocessor or Motorola power PC microprocessor, a Texas Instruments digital signal processor, or some combination of the various types or processors. Memory 840 is coupled to the processor 810 by a bus 870. Memory 840 can be dynamic random access memory (dram) and can also include static ram (SRAM), or may include FLASH EEPROM, too. The bus 870 couples the processor 810 to the memory 840, also to non- volatile storage 850, to display controller 830, and to the input/output (I/O) controller 860. Note that the display controller 830 and I/O controller 860 may be integrated together, and the display may also provide input.
The display controller 830 controls in the conventional manner a display on a display device 835 which typically is a liquid crystal display (LCD) or similar flat-panel, small form factor display. The input/output devices 855 can include a keyboard, or stylus and touchscreen, and may sometimes be extended to include disk drives, printers, a scanner, and other input and output devices, including a mouse or other pointing device. The display controller 830 and the I/O controller 860 can be implemented with conventional well known technology. A digital image input device 865 can be a digital camera which is coupled to an I/O controller 860 in order to allow images from the digital camera to be input into the device 800.
The non-volatile storage 850 is often a FLASH memory or read-only memory, or some combination of the two. A magnetic hard disk, an optical disk, or another form of storage for large amounts of data may also be used in some embodiments, though the form factors for such devices typically preclude installation as a permanent component of the device 800. Rather, a mass storage device on another computer is typically used in conjunction with the more limited storage of the device 800. Some of this data is often written, by a direct memory access process, into memory 840 during execution of software in the device 800. One of skill in the art will immediately recognize that the terms "machine-readable medium" or "computer-readable medium" includes any type of storage device that is accessible by the processor 810 and also encompasses a carrier wave that encodes a data signal.
The device 800 is one example of many possible devices which have different architectures. For example, devices based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 810 and the memory 840 (often referred to as a memory bus). The buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
In addition, the device 800 is controlled by operating system software which includes a file management system, such as a disk operating system, which is part of the operating system software. One example of an operating system software with its associated file management system software is the family of operating systems known as Windows CE® and Windows® from Microsoft Corporation of Redmond, Washington, and their associated file management systems. Another example of an operating system software with its associated file management system software is the Palm® operating system and its associated file management system. The file management system is typically stored in the non-volatile storage 850 and causes the processor 810 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 850. Other operating systems may be provided by makers of devices, and those operating systems typically will have device-specific features which are not part of similar operating systems on similar devices. Similarly, WinCE® or Palm® operating systems may be adapted to specific devices for specific device capabilities.
Device 800 may be integrated onto a single chip or set of chips in some embodiments, and typically is fitted into a small form factor for use as a personal device. Thus, it is not uncommon for a processor, bus, onboard memory, and display/I-O controllers to all be integrated onto a single chip. Alternatively, functions may be split into several chips with point- to-point interconnection, causing the bus to be logically apparent but not physically obvious from inspection of either the actual device or related schematics.
Embodiments of the above system, method and computer program product may advantageously implement a spam filter. The spam filter in the above described embodiments is not limited to any particular spam filter and a variety of alternatives known in the art or to be developed may be utilized.
Advantageously, but optionally, a particular spam filter and method for spam filtering may be utilized, embodiments of which are described hereinafter.
A system, method and apparatus is provided for a feed spam filter. The specific embodiments described in this document represent examples or embodiments of the present invention, and are illustrative in nature rather than restrictive.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention.
Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Features and aspects of various embodiments may be integrated into other embodiments, and embodiments illustrated in this document may be implemented without all of the features or aspects illustrated or described.
The feed spam filter builds on the basic Bayesian filtering technique, yet it differs from current spam solutions as it incorporates a complex collection of features that identify spam especially in feeds. The feed spam filter builds on well described approaches to minimize the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption. The approach recognizes that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing. The filter contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make it useful and applicable to a range of tasks for a feed search engine or similar system.
The basic design of the spam filter involves a Bayesian Engine used to : • receive an XML based feed and feed origin web site information
• initializes a range of probes that run specific tests on the information received
• run each of the probes in turn, collecting a probability score based on analysis and weightings
• apply feed weightings
The design can be enhanced to produce a Bayesian Net arrangement on the probes where the probes can be run :
• in particular order
• in hierarchical fashion
• the result of one probe can influence the analysis of another probe
• results of previous feed analysis can affect present analysis • historical results can affect current results
The spam filter is designed to work in multiple environments. The filter can be incorporated into the crawler architecture of a feed search engine, but can also run in a standalone mode. If environment-specific adapters are provided, the spam filter can take input from: • historical tables
• feed streams
• 3rd party data stores and other similar sources of data, for example. The activities of the spam filter in multiple tasks may include: Crawler feed filtering, historical url filtering and providing diagnostics for blacklisting. The filter may take information in a range of forms.
Filtering spam from feeds differ from filtering email as feeds because : • feeds have many more marked up fields or tags to analyze in context (e.g. Author, title, publishDate, link).
• Feeds are linked to the same web site which produces them, but their content changes often
• The originating web site is the key to the feed — whereas email spam generators are not website owners and are trying to spoof identity to obtain recipients' details • Emails are not generally searched through on a public forum or visible to the internet viewing public as a whole
• Email spam is generally targeted around a subject whereas Weblog based spam is about enticing the user to click on the site, not necessarily be engaged by the content on the site
One example of a feed spam filter applies a number of probes to a feed to determine if spam is present. FIG. 9 illustrates an embodiment of a feed spam filter. Feed spam filter 1100 includes a basic Bayes evaluator, a feed word hash tokenizer, a crawler input, and a series of probes to be applied to a feed. Crawler input 110 is an input to the feed spam filter, providing new feed data. Bayes evaluator 1120 evaluates results of probes 1150-1180, and can work with various different probes as needed - allowing for probes to be swapped in and out or activated and deactivated as necessary. Feed word hash tokenizer 1130 tokenizes data from a feed for easier processing.
Probes 1150-1180 probe the feed data by performing various tests on the feed data or related information. Thus, keyword probe 1150 may probe for keywords in the feed. URL probe 1155 may probe a url provided as a source of the feed. Feedster on probe 1160 may probe whether a search engine is operating — and thus provide an indication of whether a failure has occurred. ZipF probe 1165 may probe whether the data in the feed fits a statistical model of other spam feeds. Uncommon probe 170 may similarly probe whether a feed uses uncommon words and thus may indicate spam. Photoblog probe 1175 may probe whether a high proportion of the data in the feed is images rather than words, for example. Preblogger probe 1180 may probe whether the data looks more like a blog or like advertising, for example. More detail is provided later on various types of feed probes, as well.
While a straightforward application of a number of probes may be appropriate in some instances, a more flexible approach may be useful, too. FIG. 10 illustrates another embodiment of a feed spam filter. Feed spam filter 1200 includes sources of data, a Bayes evaluator, a tokenizer, a language library, and a network of probes. Crawler 1205 and delta indexer 1210 provide data sources. Crawler 1205 provides updated feed data from crawling the web. Delta indexer 1210 provides updated data from feed updates received based on a difference (delta) between old and new feed data. Bayes evaluator 1225 provides the feed data to tokenizer 1220 for processing and provides the tokenized data to probes along with data from language libraries 1215 to allow for proper processing of feed data.
Various probes are applied, and results from one probe may feed into another or trigger another probe. Many of these probes have been described in a similar embodiment in FIG. 1, and further discussion on these probes is provided below. Thus, url probe 1265, absolute spam test 1235 and keyword probe 1260 are applied immediately. Absolute spam test 1235 may be a simple test for basic spam signals under industry standards, for example.
The probe network may then branch out. Thus, reblogger test 1240 and zipf probe 1230 may be applied. Similarly, photoblog test 1245 and feedster on test 1275 may be applied. Moreover, uncommon word test 1270 may also be applied. These tests may feed data to other tests and affect other tests and probes as illustrated within the network. Additionally, additional test 1250 and additional test 1255 represent expansion or optional tests which may be incorporated, either to adapt to new spam techniques or to accommodate specific types of feeds, for example.
One method of analysis is key term analysis. The spam filter looks for 'unnatural' distributions of words in a feed. Results are generally more reliable for larger feeds (more words) so a weighting formula may be applied to revenue generating terms and hotlist terms.
The spam filter pays attention to terms that are used by spammers to generate revenue. The terms are weighted according to currently published rates that search engines will pay a site when an advertisement on the site, placed there by the search engine, is followed by the browser user. A list of terms specified by a user can also be used to trip the spam filter. These terms can be altered dynamically.
Another method of analysis is url analysis. UrIs for spam tend to follow the 'term loaded1 pattern that highlights subdomains matching document folders. This arises because spammers generate these sites rather than creating them manually, so that a central key term is distributed throughout the site and the url for that site. Another form of analysis is reblogging analysis. A common way to generate en mass many spam weblogs is to capture feeds from other sites and redisplay the information as their own. This activity may be referred to as ReBlogging. The spam filter uses a search engine to establish if the majority of posts on a blog are in fact owned by the author — or come from another feed or site. Yet another form of analysis is feed origin website analysis. The layout of the original site of a feed can provide more clues in terms of spam than the feed can. The feed itself is syndicated content from the site, whereas the site itself, the layout, advertising content and link structure providing navigation to other sites is a rich source of information. Combining key term results with site structure, layout and markup can provide powerful analysis not available in the email sphere. _
Yet another analysis option is tag based weighting. Certain feed fields provide places where spammers can load inappropriate terms into the feed so that search engines choose these feeds for display. The spam filter examines information from each feed field and places extra emphasis on certain fields that are important in the context of the analysis
Overall feed weightings may also provide analysis opportunities. When considering natural language analysis, feeds fall short in word count compared to the standard corpus of literature that mainstream natural language processing algorithms typically use. In one embodiment, for feeds with less than 100 words, results are typically unreliable for key term analysis, whereas feeds of more than around 300 words are unreliable for 'paying key term analysis' as previously mentioned. There are weightings in each probe to deal with this, but overall the most reliable range for a feed to return reliable spam results is about 250 words in such an embodiment. The formula used to adjust the final spam score in such an embodiment favors feeds with more words as per the following formula adjustment = -15/wordcount Λ 1.3 for feeds over 75 words
This formula is potentially open to training, meaning it may be tweaked to match human based feedback or software feedback on spam.
Post activity may also allow for analysis of feeds. Spammers in the feed world post often and often change the content of the feed on the fly (alternating title text, for example) to confuse spam filters. Random monitoring of feed activity that integrates with the Bayesian net filter will potentially flag sites where this type of activity occurs.
Client-based spam thresholding may also be used to adjust the spam filter — a score may meet a flexible limit rather than a hard-coded threshold to indicate a feed is spam. As a Bayesian filter the spam filter returns a raw score, usually ranging between, but not limited to the 0-1 range. Hence the threshold at which a feed is considered spam is configurable. Client specific and context specific threshold setting is possible based on what the feed is about, how popular it is, topical world events, topical blogosphere events and user feedback. This allows for user-based customization, among other options.
Similarly, context -based flexibility in scoring may be employed, such that the context of a feed may result in a higher or lower threshold being applied. In one embodiment, the spam filter is set to run the standard set of available probes against incoming and historical feeds. The design of the filter is such that the context of the feed could provide the setting for a pre-designated range of probes to run. Hence the filter can be 'smart1 in knowing which types of feeds are prone to particular types of spam. Additional analysis may be based on phrase or semantic analysis. This may be implemented in a separate stage of spam filtering. Alternatively, this may be implemented as a probe or set of probes, along with other types of analysis. In such an instance, the probes may then have relationships with other probes and types of analysis previously discussed.
A spam filter may thus be understood to accept some basic inputs and provide an indication of whether a feed is spam. FIG. 11 illustrates an embodiment of a feed spam filter with inputs and outputs. System 1300 includes a spam filter 1310 and associated inputs and outputs. Filter 1310 accepts as input a feed url, a feed type hash, and plain text or other feed data, for example. Filter 1310 then may provide a score (scaled between 0 and 1 in some embodiments) indicating the level of spam in a feed or in a binary sense whether a feed is or is not spam. Thus, the score may provide an indication of whether to exclude the feed or not. In some embodiments, or in some circumstances, a feed may provoke such extreme reactions from various probes that the score produced is outside the expected range. The Bayesian filter may be expected to sum results from various probes in a predetermined way, but need not be constrained to exactly meet an expected score range. One may expect that the Bayesian filter implemented may adapt to inputs and results over time, through machine-learning techniques for example, along with external feedback.
A system using a spam filter may involve a crawler to gather feed data, a spam filter, a user interface, and a repository, among other components. FIG. 12 illustrates an embodiment of a system including a feed spam filter. System 1400 includes a crawler, spam filter, spam filter user interface, crawler spam marker, database cleanser, spam blacklister, and a database or repository.
Crawler 1430 may crawl the world wide web seeking feeds and updated feeds. Spam filter 1460 may receive feed data from crawler 430 and provide an indication of whether a feed is spam or not. Spam management user interface 1470 may be used by a user to provide feedback on whether identified spam is actually spam, with spam marked or unmarked, filter data presented, and blacklisting facilities provided, for example.
Crawler adaptor 1420 may mark spam within a database 1410 based on results from spam filter 1460 and spam management user interface 1470. Database cleanser 1450 may then cleanse database 1410 of marked spam, and may also cleanse database 1410 of spam based on results from spam filter 1470, whether the indicated spam is marked or not. Moreover, blacklister 1440 may query database 1410 for data fitting blacklist parameters, and may also compare data of database 1410 with known blacklisted data to present blacklist candidates in the user interface 1470.
The basic process of a spam filter in a feed search engine or similar facility for processing feed data may be useful to understand. FIG. 13 illustrates an embodiment of a process of filtering spam feeds. Process 1500 includes receiving a feed update, processing feed data through a spam filter, evaluating results of the spam filter, determining if the data is ok, either passing the data along or flagging the feed as spam, and feeding results back. Process 1500 and other processes of this document are implemented as a set of modules, which may be process modules or operations, software modules with associated functions or effects, hardware modules designed to fulfill the process operations, or some combination of the various types of modules, for example. The modules of process 1500 and other processes described herein may be rearranged, such as in a parallel or serial fashion, and may be reordered, combined, or subdivided in various embodiments. Process 1500 initiates with receipt of a feed update at module 1510. This may be a true update, or may be the first data received for a feed, for example. The data is processed through the spam filter at module 1520. At module 1530, results of the spam filter are evaluated. This may include looking up a spam threshold score for a type of feed, and comparing the threshold to the actual score, for example. If the data is ok (the score compares favorably to the threshold), then at module
1550 the data is passed along for storage in a repository, searching and display. If the score compares unfavorably, then the data is flagged as spam at module 1560. Either way, the results are fed back into the system at module 1570 and the process repeats. Note that such feedback may include automatic feedback based on the process results, and user feedback such as modifications to results, review of test results, and addition or subtraction from a black list of spam feeds, for example.
Another illustration of the process of the actual spam filter may provide further understanding. FIG. 14 illustrates another embodiment of a process of filtering spam feeds. Process 1600 includes receiving a feed update, determining a type of feed, running probes against the feed, scoring results, receiving feedback, and adjusting probes. Thus, the spam filter may be in a more or less continuous process of evaluating and adjusting, while feeds update on an asynchronous basis.
Data is received at module 1610, such as when a feed updates and either is found with new data or pushes new data, for example. The type of feed is evaluated at module 1620, such that appropriate probes may be run against the feed or appropriate tests may be performed. At module 1630, the actual probes are run, potentially with internal probe adjustments for the type of feed as well, and also with any potential probe interchange occurring. At module 1640, results of the probes are scored through a Bayesian evaluator. At module 1650, feedback is received based on the evaluation and the probe results. This may be optional - there may be no feedback. Such feedback may be automatic within the system or user-generated. Such feedback may then results in adjustments to the probes (or scoring methodology) at module 1660.
Note that feedback at module 1660 of process 1600 or at module 1570 of process 1500 can take on a variety of forms, and allows the system to implement a machine-learning approach to improvement, among other adaptations. Thus, feedback may result in changes in the values of equations discussed within this document, tuning the approach of various probes in the process. Moreover, the structure of the probes and linkages may change as a result of feedback, such that non-linear structures may be achieved, and relationships between probes may become more complicated over time. Ultimately, the feedback is likely to make changes over time, but to continue to provide a filter which attempts to evaluate factors of a complex, non-linear relationship between word count, content, and form of a feed to determine the presence of spam.
Another example of operation of a feed spam filter may be illuminating. A feed is received by a feed search engine and the content is determined to be a Real Estate feed. This could be achieved in a number of ways and does not necessarily pertain to the spam filter. The spam filter would know through an ontology hierarchy that for Real Estate feeds there are lots of pictures, very little text and likely to be particular key term loadings. Hence probes that compensate for these characteristics should be used and perhaps the threshold for spam may differ than for other contexts (e.g. a literary weblog, for example). Specialist client specific probes could be built and run for specific feed search engine products that deal with contexts and feeds specific to the client's realm of interest. One may expect that the spam filter will involve some form of machine learning, potentially based on human user and ontology based training, for example. The weightings and structure of the spam filter may (and probably should) be adjusted to concur with real world experience. Hence it is potentially appropriate that the factors that drive all the features described above are able to be adjusted based on verification feedback that is provided to the spam filter. This means that the weightings and constants involved in formulae must be stored in a configurable datastore so these figures can change dynamically based on verification feedback from both human and software driven sources.
The following description of FIGs. 15-16 are intended to provide an overview of device hardware and other operating components suitable for performing the methods of the invention described above and hereafter, but is not intended to limit the applicable environments. Similarly, the hardware and other operating components may be suitable as part of the apparatuses described above. The invention can be practiced with other system configurations, including personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. FIG. 15 illustrates an embodiment of a system or network in which a feed spam filter may operate. FIG. 16 illustrates an embodiment of a system which may operate with a feed spam filter. FIG. 15 shows several computer systems that are coupled together through a network 1705, such as the internet, along with a cellular or other wireless network and related cellular or other wireless devices. The term "internet" as used herein refers to a network of networks which uses certain protocols, such as the TCP/IP protocol, and possibly other protocols such as the hypertext transfer protocol (HTTP) for hypertext markup language (HTML) documents that make up the world wide web (web). The physical connections of the internet and the protocols and communication procedures of the internet are well known to those of skill in the art.
Access to the internet 1705 is typically provided by internet service providers (ISP), such as the ISPs 1710 and 1715. Users on client systems, such as client computer systems 1730, 1750, and 1760 obtain access to the internet through the internet service providers, such as ISPs
1710 and 1715. Access to the internet allows users of the client computer systems to exchange information, receive and send e-mails, and view documents, such as documents which have been prepared in the HTML format. These documents are often provided by web servers, such as web server 1720 which is considered to be "on" the internet. Often these web servers are provided by the ISPs, such as ISP 1710, although a computer system can be set up and connected to the internet without that system also being an ISP.
The web server 1720 is typically at least one computer system which operates as a server computer system and is configured to operate with the protocols of the world wide web and is coupled to the internet. Optionally, the web server 1720 can be part of an ISP which provides access to the internet for client systems. The web server 1720 is shown coupled to the server computer system 1725 which itself is coupled to web content 1795, which can be considered a form of a media database. While two computer systems 1720 and 1725 are shown in FIG. 15, the web server system 1720 and the server computer system 1725 can be one computer system having different software components providing the web server functionality and the server functionality provided by the server computer system 1725 which will be described further below.
Cellular network interface 1743 provides an interface between a cellular network and corresponding cellular devices 1744, 1746 and 1748 on one side, and network 1705 on the other side. Thus cellular devices 1744, 1746 and 1748, which may be personal devices including cellular telephones, two-way pagers, personal digital assistants or other similar devices, may connect with network 1705 and exchange information such as email, content, or HTTP- formatted data, for example.
Cellular network interface 1743 is representative of wireless networking in general. In various embodiments, such an interface may also be implemented as a wireless interface such as a Bluetooth interface, IEEE 802.11 interface, or some other form of wireless network. Similarly, devices such as devices 1744, 1746 and 1748 may be implemented to communicate via the Bluetooth or 802.11 protocols, for example. Other dedicated wireless networks may also be implemented in a similar fashion. Cellular network interface 1743 is coupled to computer 1740, which communicates with network 1705 through modem interface 1745. Computer 1740 may be a personal computer, server computer or the like, and serves as a gateway. Thus, computer 1740 may be similar to client computers 1750 and 1760 or to gateway computer 1775, for example. Software or content may then be uploaded or downloaded through the connection provided by interface 1743, computer 1740 and modem 1745.
Client computer systems 1730, 1750, and 1760 can each, with the appropriate web browsing software, view HTML pages provided by the web server 1720. The ISP 1710 provides internet connectivity to the client computer system 1730 through the modem interface 1735 which can be considered part of the client computer system 1730. The client computer system can be a personal computer system, a network computer, a web tv system, or other such computer system.
Similarly, the ISP 1715 provides internet connectivity for client systems 1750 and 1760, although as shown in FIG. 15, the connections are not the same as for more directly connected computer systems. Client computer systems 1750 and 1760 are part of a LAN coupled through a gateway computer 1775. While FIG. 15 shows the interfaces 1735 and 1745 as generically as a "modem," each of these interfaces can be an analog modem, isdn modem, cable modem, satellite transmission interface (e.g. "direct PC"), or other interfaces for coupling a computer system to other computer systems.
Client computer systems 1750 and 1760 are coupled to a LAN 1770 through network interfaces 1755 and 1765, which can be ethernet network or other network interfaces. The LAN 1770 is also coupled to a gateway computer system 1775 which can provide firewall and other internet related services for the local area network. This gateway computer system 1775 is coupled to the ISP 1715 to provide internet connectivity to the client computer systems 1750 and 1760. The gateway computer system 1775 can be a conventional server computer system. Also, the web server system 1720 can be a conventional server computer system.
Alternatively, a server computer system 1780 can be directly coupled to the LAN 1770 through a network interface 1785 to provide files 1790 and other services to the clients 1750, 1760, without the need to connect to the internet through the gateway system 1775. FIG. 16 shows one example of a personal device that can be used as a cellular telephone (1744, 1746 or 1748) or similar personal device, or may be used as a more conventional personal computer, as an embedded processor or local console, or as a PDA, for example. Such a device can be used to perform many functions depending on implementation, such as monitoring functions, user interface functions, telephone communications, two-way pager communications, personal organizing, or similar functions. The system 1800 of FIG. 16 may also be used to implement other devices such as a personal computer, network computer, or other similar systems. The computer system 1800 interfaces to external systems through the communications interface 1820. In a cellular telephone, this interface is typically a radio interface for communication with a cellular network, and may also include some form of cabled interface for use with an immediately available personal computer. In a two-way pager, the communications interface 820 is typically a radio interface for communication with a data transmission network, but may similarly include a cabled or cradled interface as well. In a personal digital assistant, communications interface 1820 typically includes a cradled or cabled interface, and may also include some form of radio interface such as a Bluetooth or 802.11 interface, or a cellular radio interface for example. The computer system 1800 includes a processor 1810, which can be a conventional microprocessor such as an Intel Pentium microprocessor or Motorola power PC microprocessor, a Texas Instruments digital signal processor, or some combination of the various types or processors. Memory 1840 is coupled to the processor 1810 by a bus 1870. Memory 1840 can be dynamic random access memory (dram) and can also include static ram (SRAM), or may include FLASH EEPROM, too. The bus 1870 couples the processor 1810 to the memory 1840, also to non-volatile storage 1850, to display controller 1830, and to the input/output (I/O) controller 1860. Note that the display controller 1830 and I/O controller 1860 may be integrated together, and the display may also provide input.
The display controller 1830 controls in the conventional manner a display on a display device 1835 which typically is a liquid crystal display (LCD) or similar flat-panel, small form factor display. The input/output devices 1855 can include a keyboard, or stylus and touchscreen, and may sometimes be extended to include disk drives, printers, a scanner, and other input and output devices, including a mouse or other pointing device. The display controller 1830 and the I/O controller 1860 can be implemented with conventional well known technology. A digital image input device 1865 can be a digital camera which is coupled to an I/O controller
1860 in order to allow images from the digital camera to be input into the device 1800.
The non-volatile storage 1850 is often a FLASH memory or read-only memory, or some combination of the two. A magnetic hard disk, an optical disk, or another form of storage for large amounts of data may also be used in some embodiments, though the form factors for such devices typically preclude installation as a permanent component of the device 1800.
Rather, a mass storage device on another computer is typically used in conjunction with the more limited storage of the device 1800. Some of this data is often written, by a direct memory access process, into memory 1840 during execution of software in the device 1800. One of skill in the art will immediately recognize that the terms "machine-readable medium" or "computer-readable medium" includes any type of storage device that is accessible by the processor 1810 and also encompasses a carrier wave that encodes a data signal.
The device 1800 is one example of many possible devices which have different architectures. For example, devices based on an Intel microprocessor often have multiple buses, one of which can be an input/output (I/O) bus for the peripherals and one that directly connects the processor 1810 and the memory 1840 (often referred to as a memory bus). The buses are connected together through bridge components that perform any necessary translation due to differing bus protocols.
In addition, the device 1800 is controlled by operating system software which includes a file management system, such as a disk operating system, which is part of the operating system software. One example of an operating system software with its associated file management system software is the family of operating systems known as Windows CE® and Windows® from Microsoft Corporation of Redmond, Washington, and their associated file management systems. Another example of an operating system software with its associated file management system software is the Palm® operating system and its associated file management system. The file management system is typically stored in the non-volatile storage 1850 and causes the processor 1810 to execute the various acts required by the operating system to input and output data and to store data in memory, including storing files on the non-volatile storage 1850. Other operating systems may be provided by makers of devices, and those operating systems typically will have device-specific features which are not part of similar operating systems on similar devices. Similarly, WinCE® or Palm® operating systems may be adapted to specific devices for specific device capabilities.
Device 1800 may be integrated onto a single chip or set of chips in some embodiments, and typically is fitted into a small form factor for use as a personal device. Thus, it is not uncommon for a processor, bus, onboard memory, and display/I-O controllers to all be integrated onto a single chip. Alternatively, functions may be split into several chips with point- to-point interconnection, causing the bus to be logically apparent but not physically obvious from inspection of either the actual device or related schematics.
It may be appreciated in light of the description provided here that the afore described computer system may advantageously provide a particular spam filter system that provides a method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
Aspects of the invention may also independently provide a method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
Embodiments of the invention may also provide a spam feed filter for filtering spam in a feed, the filter using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
References to processes are understood to be performed in a computer having a processor and a memory coupled to the processor. These computers may be considered to be systems or subsystems or functional blocks depending upon the architecture of the overall system and for example, the distribution of functional responsibilities within the overall system. In some embodiments, the feed crawling may occur in or be performed by a network server or servers adapted to perform the feed crawling method.
The present invention, in some embodiments, also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language, and various embodiments may thus be implemented using a variety of programming languages.
One skilled in the art will appreciate that although specific examples and embodiments of the system and methods have been described for purposes of illustration, various modifications can be made without deviating from the present invention. For example, embodiments of the present invention may be applied to many different types of databases, systems and application programs. Moreover, features of one embodiment may be incorporated into other embodiments, even where those features are not described together in a single embodiment within the present document.

Claims

CLAIMSWhat is claimed is:
1. A computer system for crawling content feeds, the computer system comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.
2. A computer system as in claim 1, wherein the first process, the parallelized process, and the indexing processes are performed on or within the same processor.
3. A computer system as in claim 1, wherein the at least one processor comprises a plurality of processors; and the first process, the parallelized process, and the indexing processes are performed on or within different ones of the plurality of processors.
4. A computer system as in claim 3, wherein the first process, the parallelized process, and the indexing process are each performed in a different processes executed within separate subsystems.
5. A computer system as in claim 1, wherein the parallelized crawler process comprises a plurality of child crawler processes.
6. A computer system as in claim 1 , wherein the parallelized crawler process uses threading.
7. A computer system as in claim 1, wherein the first processes is executed within a first process subsystem, and further comprises: a last indexed crawler functional block to identify the least recently indexed url; a fast index crawler functional block to identify the urls updated and not crawled; and a discovery crawler functional block to identify the urls not yet crawled.
8. A computer system as in claim 1, wherein the first processes subsystem further comprises: a manual crawler functional block to identify the urls a user has submitted for crawling; a batch process functional block to output urls to the manual crawler functional block; a ping crawler functional block to output urls to the manual crawler functional block.
9. A system for implementing a feed crawling system, the system comprising: a distributed database system for providing storage; a generic configurators system for creating a list of urls to be crawled; a job server system for prioritizing the list of urls to be crawled; a parallelized crawler system for crawling the urls and storing the results in the distributed database system; and an indexing system for indexing the database of urls for a user to search.
10. A computer system as in claim 9, wherein the parallelized crawler system comprises a plurality of child crawler processes.
11. A computer system as in claim 9, wherein the parallelized crawler system uses threading.
12. A computer system as in claim 9, wherein the generic configurators system further comprises: a last indexed crawler system to indicate the least recently indexed url; a fast index crawler system to indicate the urls updated and not crawled; and a discovery crawler system to indicate the urls not yet crawled.
13. A computer system as in claim 5, wherein the generic configurators system further comprises: a manual crawler system to indicate the urls a user has submitted for crawling; and a batch process system to output urls to the manual crawler system.
14. A computer system as in claim 5, wherein the generic configurators system further comprises: a search term system wherein a user inputs the user's search terms; a topical event system wherein the user's search topic is determined; an adaptive crawler system to provide urls to be crawled based on an output received from the search term system, and the topical event system.
15. A computer system as in claim 9, wherein the generic configurators system further comprises : a sequential ping crawler system to choose crawling priority based on a first-in- first-out model.
16. A computer system as in claim 9, wherein the generic configurators system further comprises : a probabilistic ping crawler system to choose crawling priority based on probability that a url will be searched for.
17. A computer system as in claim 9, wherein the job server system breaks up the lists of urls to be crawled into jobs to assign to parallelized crawler.
18. A computer system as in claim 9, wherein the job server system throttles the crawl frequency on a feed.
19. A computer system as in claim 9, wherein the job server system delays or stops crawling jobs based on popularity.
20. A computer system as in claim 9, wherein the job server system records statistical trends on search topics for advertising targeting.
21. A computer system as in claim 9, wherein the job server system further comprises: a spam filter system.
22. A computer system as in claim 5, wherein the job server system further comprises: a prioritizing system for determining the crawling order of a list of crawl jobs; the prioritizing system further comprising: a relevancy factor system for determining whether a url is relevant to a user's search term; a popular search term system for determining users' popular search terms; a trend analysis system for determining the emerging search trends; and a throttle adjustment system for adjusting crawl frequency on a feed.
23. A computer system as in claim 22, wherein the relevancy factor system assigns a relevancy value to each url in the list of urls to be crawled.
24. A computer system as in claim 9, wherein the job server system further comprises: a crawl job results system to collect crawl job results; a throttle adjustment system to adjust the frequency of crawling based on output from the crawl job results; a crawl statistics generation system to compile statistics information based on output from the crawl job results, and store the statistics in the distributed database system.
25. A computer system as in claim 24, wherein the crawl statistics generation system outputs crawl statistics to an ad server to provide advertising.
26. A computer system as in claim 9, wherein the job server system prioritizes crawl jobs based on data received from updated urls, and dispatches crawling jobs to the parallelized crawler system based on the priority list.
27. A computer system as in claim 9, wherein the job server system prioritizes crawl jobs based on environmental information received from searches conducted; dispatches crawling jobs to the parallelized crawler system based on the priority list; and receives responsive data from the parallelized crawler system, and store them in the distributive database system.
28. A computer system as in claim 9, wherein the parallelized crawler system: receives a url job, determines whether it is a spam (yes) or not a spam (no); if yes (spam), it skips the url, and returns the crawling results; if no (not spam), it requests data at the url, and determines whether the data is usable, if not usable, attempts to translate or transform the data into usable data; if unsuccessful at translating or transforming the data in to a usable form, it, skips the url, and returns crawling results; if the data is usable, it determines what type the url is, parses the data; filters the data for spam; categorizes the data based on a hierarchy; and returns the crawling results.
29. A computer system of claim 9, wherein the job server system prioritizes the list of urls to be crawled by receiving updated data from urls, or an update notification, or a user's search query.
30. A computer system of claim 9, wherein the job server system prioritizes the list of urls to be crawled based on frequency of searches for a topic and average expected number of crawled urls.
31. A computer system of claim 30, wherein if the average expected number of crawls is not crawled, the job server system finds more urls to be crawled.
32. A method for crawling content feeds, the method comprising: at least one processor for executing at least one process; providing a database providing a storage for storing location information or universal reference locators (urls); executing a first process for prioritizing a list of urls to be crawled; executing a parallelized crawler process for crawling the urls and storing the results in the database; and executing an indexing process for indexing the database for a user to search.
33. A computer program product stored on a computer readable media for causing a computer to execute a method for crawling content feeds, the method comprising: at least one processor for executing at least one process; providing a database providing a storage for storing location information or universal reference locators (urls); executing a first process for prioritizing a list of urls to be crawled; executing a parallelized crawler process for crawling the urls and storing the results in the database; and executing an indexing process for indexing the database for a user to search.
34. A computer system as in claim 21, wherein the spam filter system provides a method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
35. A method for filtering spam in a feed, the method using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
36. A spam feed filter for filtering spam in a feed, the filter using a Bayesian filtering technique and characterized in that: it incorporates a complex collection of features that identify spam in feeds and minimizes the integration into a search index of irrelevant and somewhat deceitful use of internet content feeds for public consumption by recognizing that feeds are more rich in content than similar email filtering technologies, and utilizes the structure of the internet as well as the HTML that makes up a web site in order to identify feeds that are not desirable for a public search engine to be indexing, and the filter and filtering method contains context based flexibilities, threshold based flexibilities and multi-environment flexibilities to make the filter and filtering method useful and applicable to a range of tasks for a feed search engine or similar system.
PCT/US2007/019558 2006-09-07 2007-09-07 Feed crawling system and method and spam feed filter WO2008030568A2 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US82490306P 2006-09-07 2006-09-07
US60/824,903 2006-09-07
US82511406P 2006-09-08 2006-09-08
US60/825,114 2006-09-08
US85059207A 2007-09-05 2007-09-05
US85057707A 2007-09-05 2007-09-05
US11/850,577 2007-09-05
US11/850,592 2007-09-05

Publications (2)

Publication Number Publication Date
WO2008030568A2 true WO2008030568A2 (en) 2008-03-13
WO2008030568A3 WO2008030568A3 (en) 2008-10-16

Family

ID=39157869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/019558 WO2008030568A2 (en) 2006-09-07 2007-09-07 Feed crawling system and method and spam feed filter

Country Status (1)

Country Link
WO (1) WO2008030568A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491438A (en) * 2018-02-12 2018-09-04 陆夏根 A kind of technology policy retrieval analysis method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710672B (en) * 2018-05-17 2020-04-14 南京大学 Theme crawler method based on incremental Bayesian algorithm

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182085B1 (en) * 1998-05-28 2001-01-30 International Business Machines Corporation Collaborative team crawling:Large scale information gathering over the internet
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US20020188841A1 (en) * 1995-07-27 2002-12-12 Jones Kevin C. Digital asset management and linking media signals with related data using watermarks
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
US6631369B1 (en) * 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6738767B1 (en) * 2000-03-20 2004-05-18 International Business Machines Corporation System and method for discovering schematic structure in hypertext documents
US20050086206A1 (en) * 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
US20050262062A1 (en) * 2004-05-08 2005-11-24 Xiongwu Xia Methods and apparatus providing local search engine
US20060136420A1 (en) * 2004-12-20 2006-06-22 Yahoo!, Inc. System and method for providing improved access to a search tool in electronic mail-enabled applications

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188841A1 (en) * 1995-07-27 2002-12-12 Jones Kevin C. Digital asset management and linking media signals with related data using watermarks
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6182085B1 (en) * 1998-05-28 2001-01-30 International Business Machines Corporation Collaborative team crawling:Large scale information gathering over the internet
US6631369B1 (en) * 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US6738767B1 (en) * 2000-03-20 2004-05-18 International Business Machines Corporation System and method for discovering schematic structure in hypertext documents
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
US20050086206A1 (en) * 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
US20050262062A1 (en) * 2004-05-08 2005-11-24 Xiongwu Xia Methods and apparatus providing local search engine
US20060136420A1 (en) * 2004-12-20 2006-06-22 Yahoo!, Inc. System and method for providing improved access to a search tool in electronic mail-enabled applications

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491438A (en) * 2018-02-12 2018-09-04 陆夏根 A kind of technology policy retrieval analysis method

Also Published As

Publication number Publication date
WO2008030568A3 (en) 2008-10-16

Similar Documents

Publication Publication Date Title
CA2498376C (en) Principles and methods for personalizing newsfeeds via an analysis of information novelty and dynamics
US20110087647A1 (en) System and method for providing web search results to a particular computer user based on the popularity of the search results with other computer users
US7949660B2 (en) Method and apparatus for searching and resource discovery in a distributed enterprise system
US8510377B2 (en) Methods and systems for exploring a corpus of content
KR101315554B1 (en) Keyword assignment to a web page
US8832069B2 (en) System and method for adding identity to web rank
US9317612B1 (en) System and method for managing multiple content feeds
US20040260695A1 (en) Systems and methods to tune a general-purpose search engine for a search entry point
US7853583B2 (en) System and method for generating expertise based search results
WO2007015990A2 (en) Techniques for analyzing and presenting information in an event-based data aggregation system
EP1700236A1 (en) Systems and methods for unification of search results
US20150186385A1 (en) Method, System, and Graphical User Interface For Improved Search Result Displays Via User-Specified Annotations
Schonfeld et al. Sitemaps: above and beyond the crawl of duty
JP2010257453A (en) System for tagging of document using search query data
JP2006099341A (en) Update history generation device and program
US8312011B2 (en) System and method for automatic detection of needy queries
US20100332491A1 (en) Method and system for utilizing user selection data to determine relevance of a web document for a search query
KR102054020B1 (en) Building of a web corpus with the help of a reference web crawl
US8621339B2 (en) Method of creating graph structure from time-series of attention data
WO2007057747A2 (en) A method and a system for publishing data
Best et al. Europe media monitor
WO2008030568A2 (en) Feed crawling system and method and spam feed filter
Johansen et al. WAIF: Web of asynchronous information filters
Mullery et al. Building topic-specific collections with intelligent agents
US20060190534A1 (en) Method and system for browsing a plurality of information items

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07811709

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07811709

Country of ref document: EP

Kind code of ref document: A2