WO2009129425A3 - Forum web page clustering based on repetitive regions - Google Patents

Forum web page clustering based on repetitive regions Download PDF

Info

Publication number
WO2009129425A3
WO2009129425A3 PCT/US2009/040881 US2009040881W WO2009129425A3 WO 2009129425 A3 WO2009129425 A3 WO 2009129425A3 US 2009040881 W US2009040881 W US 2009040881W WO 2009129425 A3 WO2009129425 A3 WO 2009129425A3
Authority
WO
WIPO (PCT)
Prior art keywords
repetitive regions
web page
page
regions
feature set
Prior art date
Application number
PCT/US2009/040881
Other languages
French (fr)
Other versions
WO2009129425A2 (en
Inventor
Wei Lai
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corporation filed Critical Microsoft Corporation
Priority to EP09731930A priority Critical patent/EP2291812A4/en
Publication of WO2009129425A2 publication Critical patent/WO2009129425A2/en
Publication of WO2009129425A3 publication Critical patent/WO2009129425A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.
PCT/US2009/040881 2008-04-16 2009-04-16 Forum web page clustering based on repetitive regions WO2009129425A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP09731930A EP2291812A4 (en) 2008-04-16 2009-04-16 Forum web page clustering based on repetitive regions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12/103,712 US8051083B2 (en) 2008-04-16 2008-04-16 Forum web page clustering based on repetitive regions
US12/103,712 2008-04-16

Publications (2)

Publication Number Publication Date
WO2009129425A2 WO2009129425A2 (en) 2009-10-22
WO2009129425A3 true WO2009129425A3 (en) 2010-02-18

Family

ID=41199745

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/040881 WO2009129425A2 (en) 2008-04-16 2009-04-16 Forum web page clustering based on repetitive regions

Country Status (3)

Country Link
US (1) US8051083B2 (en)
EP (1) EP2291812A4 (en)
WO (1) WO2009129425A2 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046361B2 (en) * 2008-04-18 2011-10-25 Yahoo! Inc. System and method for classifying tags of content using a hyperlinked corpus of classified web pages
US8099408B2 (en) * 2008-06-27 2012-01-17 Microsoft Corporation Web forum crawling using skeletal links
US8370119B2 (en) * 2009-02-19 2013-02-05 Microsoft Corporation Website design pattern modeling
US8429110B2 (en) * 2010-06-10 2013-04-23 Microsoft Corporation Pattern tree-based rule learning
US8458584B1 (en) * 2010-06-28 2013-06-04 Google Inc. Extraction and analysis of user-generated content
US20120101721A1 (en) * 2010-10-21 2012-04-26 Telenav, Inc. Navigation system with xpath repetition based field alignment mechanism and method of operation thereof
US9171080B2 (en) 2010-11-12 2015-10-27 Microsoft Technology Licensing Llc Domain constraint path based data record extraction
US8983980B2 (en) 2010-11-12 2015-03-17 Microsoft Technology Licensing, Llc Domain constraint based data record extraction
US8825813B2 (en) 2010-12-28 2014-09-02 Microsoft Corporation Distributed network coordinate system based on network performance
US8527516B1 (en) * 2011-02-25 2013-09-03 Google Inc. Identifying similar digital text volumes
US8762365B1 (en) * 2011-08-05 2014-06-24 Amazon Technologies, Inc. Classifying network sites using search queries
USD674404S1 (en) 2011-10-26 2013-01-15 Mcafee, Inc. Computer having graphical user interface
USD673967S1 (en) 2011-10-26 2013-01-08 Mcafee, Inc. Computer having graphical user interface
USD674403S1 (en) 2011-10-26 2013-01-15 Mcafee, Inc. Computer having graphical user interface
USD677687S1 (en) 2011-10-27 2013-03-12 Mcafee, Inc. Computer display screen with graphical user interface
US20130144847A1 (en) * 2011-12-05 2013-06-06 Google Inc. De-Duplication of Featured Content
US20140101533A1 (en) * 2012-10-02 2014-04-10 Percussion Software, Inc. Lossless application of new information architecture to existing websites, web pages, and online content
CN103853770B (en) * 2012-12-03 2018-08-14 北大方正集团有限公司 The method and system of model content in a kind of extraction forum Web pages
US9373031B2 (en) * 2013-03-14 2016-06-21 Digitech Systems Private Reserve, LLC System and method for document alignment, correction, and classification
JP5882272B2 (en) * 2013-08-30 2016-03-09 京セラドキュメントソリューションズ株式会社 Document evaluation program and document evaluation apparatus
US20150100563A1 (en) * 2013-10-09 2015-04-09 Go Daddy Operating Company, LLC Method for retaining search engine optimization in a transferred website
US10095677B1 (en) * 2014-06-26 2018-10-09 Amazon Technologies, Inc. Detection of layouts in electronic documents
US10691664B1 (en) * 2017-07-18 2020-06-23 FullStory, Inc. User interface structural clustering and analysis
US11455485B2 (en) * 2020-06-29 2022-09-27 Adobe Inc. Content prediction based on pixel-based vectors

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050246296A1 (en) * 2004-04-29 2005-11-03 Microsoft Corporation Method and system for calculating importance of a block within a display page
JP2007080061A (en) * 2005-09-15 2007-03-29 Univ Of Tsukuba Retrieval method of web page and clustering method of web page
US20070208703A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation Web forum crawler
US20080010291A1 (en) * 2006-07-05 2008-01-10 Krishna Leela Poola Techniques for clustering structurally similar web pages

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119124A (en) * 1998-03-26 2000-09-12 Digital Equipment Corporation Method for clustering closely resembling data objects
US6751621B1 (en) * 2000-01-27 2004-06-15 Manning & Napier Information Services, Llc. Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US7185001B1 (en) * 2000-10-04 2007-02-27 Torch Concepts Systems and methods for document searching and organizing
JP3664475B2 (en) * 2001-02-09 2005-06-29 インターナショナル・ビジネス・マシーンズ・コーポレーション Information processing method, information processing system, program, and recording medium
US6804684B2 (en) * 2001-05-07 2004-10-12 Eastman Kodak Company Method for associating semantic information with multiple images in an image database environment
US7143365B2 (en) * 2002-06-18 2006-11-28 Webtrends, Inc. Method and apparatus for using a browser to configure a software program
US7613687B2 (en) * 2003-05-30 2009-11-03 Truelocal Inc. Systems and methods for enhancing web-based searching
US8346770B2 (en) * 2003-09-22 2013-01-01 Google Inc. Systems and methods for clustering search results
US7293007B2 (en) * 2004-04-29 2007-11-06 Microsoft Corporation Method and system for identifying image relatedness using link and page layout analysis
CN1702651A (en) * 2004-05-24 2005-11-30 富士通株式会社 Recognition method and apparatus for information files of specific types
US7565346B2 (en) * 2004-05-31 2009-07-21 International Business Machines Corporation System and method for sequence-based subspace pattern clustering
US7428530B2 (en) * 2004-07-01 2008-09-23 Microsoft Corporation Dispersing search engine results by using page category information
US20060143158A1 (en) * 2004-12-14 2006-06-29 Ruhl Jan M Method, system and graphical user interface for providing reviews for a product
US8117203B2 (en) * 2005-07-15 2012-02-14 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US7725451B2 (en) * 2006-01-23 2010-05-25 Microsoft Corporation Generating clusters of images for search results
US7571162B2 (en) * 2006-03-01 2009-08-04 Microsoft Corporation Comparative web search
US7676465B2 (en) * 2006-07-05 2010-03-09 Yahoo! Inc. Techniques for clustering structurally similar web pages based on page features
US7660804B2 (en) * 2006-08-16 2010-02-09 Microsoft Corporation Joint optimization of wrapper generation and template detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050246296A1 (en) * 2004-04-29 2005-11-03 Microsoft Corporation Method and system for calculating importance of a block within a display page
JP2007080061A (en) * 2005-09-15 2007-03-29 Univ Of Tsukuba Retrieval method of web page and clustering method of web page
US20070208703A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation Web forum crawler
US20080010291A1 (en) * 2006-07-05 2008-01-10 Krishna Leela Poola Techniques for clustering structurally similar web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2291812A4 *

Also Published As

Publication number Publication date
WO2009129425A2 (en) 2009-10-22
EP2291812A2 (en) 2011-03-09
US20090265363A1 (en) 2009-10-22
US8051083B2 (en) 2011-11-01
EP2291812A4 (en) 2013-02-06

Similar Documents

Publication Publication Date Title
WO2009129425A3 (en) Forum web page clustering based on repetitive regions
Iam-On et al. Linkclue: A matlab package for link-based cluster ensembles
ATE450012T1 (en) COMPUTER-ASSISTED DOCUMENT RETRIEVAL
WO2010144671A3 (en) Trend analysis in content identification based on fingerprinting
WO2007084836A3 (en) Match-based employment system and method
WO2012006097A3 (en) Low confidence face recognition
WO2007005975A3 (en) Risk modeling system
GB2457515A (en) Similarity detection and clustering of images
WO2006099971A3 (en) Multi-ply security paper
WO2010008800A3 (en) Query identification and association
TW200739371A (en) Information processing apparatus and method, and a computer readable storage medium encoded with a computer program
WO2012177794A3 (en) Identifying information related to a particular entity from electronic sources, using dimensional reduction and quantum clustering
WO2008085458A3 (en) Irregular input identification
ATE479384T1 (en) BODY PARAMETERS DETECTION
WO2010063463A3 (en) Face recognition using face detection classifier data
WO2010141799A3 (en) Feature engineering and user behavior analysis
EP2671189A1 (en) Systems and methods for biometric identification
IL192727A0 (en) Method and apparatuses for extending dynamic handwriting recognition to recognize static handwritten and machine generated text
WO2007059232A3 (en) Methods and apparatus for probe-based clustering
WO2009120263A3 (en) System and method for illumination invariant image segmentation
WO2008152277A3 (en) Method of handling transmittals including a graphic classification of the signatures associated with the transmittals
EP1501047A3 (en) Fingerprinting segments of data content for version identification
WO2009108871A3 (en) Feedback systems and methods for recognizing patterns
WO2009060888A1 (en) Author's influence determination system, author's influence determination method, and program
WO2009065146A3 (en) System and method for conducting a patent search

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09731930

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009731930

Country of ref document: EP