WO2009129425A3 - Forum web page clustering based on repetitive regions - Google Patents
Forum web page clustering based on repetitive regions Download PDFInfo
- Publication number
- WO2009129425A3 WO2009129425A3 PCT/US2009/040881 US2009040881W WO2009129425A3 WO 2009129425 A3 WO2009129425 A3 WO 2009129425A3 US 2009040881 W US2009040881 W US 2009040881W WO 2009129425 A3 WO2009129425 A3 WO 2009129425A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- repetitive regions
- web page
- page
- regions
- feature set
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Operations Research (AREA)
- Economics (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09731930A EP2291812A4 (en) | 2008-04-16 | 2009-04-16 | Forum web page clustering based on repetitive regions |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/103,712 US8051083B2 (en) | 2008-04-16 | 2008-04-16 | Forum web page clustering based on repetitive regions |
US12/103,712 | 2008-04-16 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2009129425A2 WO2009129425A2 (en) | 2009-10-22 |
WO2009129425A3 true WO2009129425A3 (en) | 2010-02-18 |
Family
ID=41199745
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2009/040881 WO2009129425A2 (en) | 2008-04-16 | 2009-04-16 | Forum web page clustering based on repetitive regions |
Country Status (3)
Country | Link |
---|---|
US (1) | US8051083B2 (en) |
EP (1) | EP2291812A4 (en) |
WO (1) | WO2009129425A2 (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8046361B2 (en) * | 2008-04-18 | 2011-10-25 | Yahoo! Inc. | System and method for classifying tags of content using a hyperlinked corpus of classified web pages |
US8099408B2 (en) * | 2008-06-27 | 2012-01-17 | Microsoft Corporation | Web forum crawling using skeletal links |
US8370119B2 (en) * | 2009-02-19 | 2013-02-05 | Microsoft Corporation | Website design pattern modeling |
US8429110B2 (en) * | 2010-06-10 | 2013-04-23 | Microsoft Corporation | Pattern tree-based rule learning |
US8458584B1 (en) * | 2010-06-28 | 2013-06-04 | Google Inc. | Extraction and analysis of user-generated content |
US20120101721A1 (en) * | 2010-10-21 | 2012-04-26 | Telenav, Inc. | Navigation system with xpath repetition based field alignment mechanism and method of operation thereof |
US9171080B2 (en) | 2010-11-12 | 2015-10-27 | Microsoft Technology Licensing Llc | Domain constraint path based data record extraction |
US8983980B2 (en) | 2010-11-12 | 2015-03-17 | Microsoft Technology Licensing, Llc | Domain constraint based data record extraction |
US8825813B2 (en) | 2010-12-28 | 2014-09-02 | Microsoft Corporation | Distributed network coordinate system based on network performance |
US8527516B1 (en) * | 2011-02-25 | 2013-09-03 | Google Inc. | Identifying similar digital text volumes |
US8762365B1 (en) * | 2011-08-05 | 2014-06-24 | Amazon Technologies, Inc. | Classifying network sites using search queries |
USD674404S1 (en) | 2011-10-26 | 2013-01-15 | Mcafee, Inc. | Computer having graphical user interface |
USD673967S1 (en) | 2011-10-26 | 2013-01-08 | Mcafee, Inc. | Computer having graphical user interface |
USD674403S1 (en) | 2011-10-26 | 2013-01-15 | Mcafee, Inc. | Computer having graphical user interface |
USD677687S1 (en) | 2011-10-27 | 2013-03-12 | Mcafee, Inc. | Computer display screen with graphical user interface |
US20130144847A1 (en) * | 2011-12-05 | 2013-06-06 | Google Inc. | De-Duplication of Featured Content |
US20140101533A1 (en) * | 2012-10-02 | 2014-04-10 | Percussion Software, Inc. | Lossless application of new information architecture to existing websites, web pages, and online content |
CN103853770B (en) * | 2012-12-03 | 2018-08-14 | 北大方正集团有限公司 | The method and system of model content in a kind of extraction forum Web pages |
US9373031B2 (en) * | 2013-03-14 | 2016-06-21 | Digitech Systems Private Reserve, LLC | System and method for document alignment, correction, and classification |
JP5882272B2 (en) * | 2013-08-30 | 2016-03-09 | 京セラドキュメントソリューションズ株式会社 | Document evaluation program and document evaluation apparatus |
US20150100563A1 (en) * | 2013-10-09 | 2015-04-09 | Go Daddy Operating Company, LLC | Method for retaining search engine optimization in a transferred website |
US10095677B1 (en) * | 2014-06-26 | 2018-10-09 | Amazon Technologies, Inc. | Detection of layouts in electronic documents |
US10691664B1 (en) * | 2017-07-18 | 2020-06-23 | FullStory, Inc. | User interface structural clustering and analysis |
US11455485B2 (en) * | 2020-06-29 | 2022-09-27 | Adobe Inc. | Content prediction based on pixel-based vectors |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050246296A1 (en) * | 2004-04-29 | 2005-11-03 | Microsoft Corporation | Method and system for calculating importance of a block within a display page |
JP2007080061A (en) * | 2005-09-15 | 2007-03-29 | Univ Of Tsukuba | Retrieval method of web page and clustering method of web page |
US20070208703A1 (en) * | 2006-03-03 | 2007-09-06 | Microsoft Corporation | Web forum crawler |
US20080010291A1 (en) * | 2006-07-05 | 2008-01-10 | Krishna Leela Poola | Techniques for clustering structurally similar web pages |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6119124A (en) * | 1998-03-26 | 2000-09-12 | Digital Equipment Corporation | Method for clustering closely resembling data objects |
US6751621B1 (en) * | 2000-01-27 | 2004-06-15 | Manning & Napier Information Services, Llc. | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors |
US7185001B1 (en) * | 2000-10-04 | 2007-02-27 | Torch Concepts | Systems and methods for document searching and organizing |
JP3664475B2 (en) * | 2001-02-09 | 2005-06-29 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Information processing method, information processing system, program, and recording medium |
US6804684B2 (en) * | 2001-05-07 | 2004-10-12 | Eastman Kodak Company | Method for associating semantic information with multiple images in an image database environment |
US7143365B2 (en) * | 2002-06-18 | 2006-11-28 | Webtrends, Inc. | Method and apparatus for using a browser to configure a software program |
US7613687B2 (en) * | 2003-05-30 | 2009-11-03 | Truelocal Inc. | Systems and methods for enhancing web-based searching |
US8346770B2 (en) * | 2003-09-22 | 2013-01-01 | Google Inc. | Systems and methods for clustering search results |
US7293007B2 (en) * | 2004-04-29 | 2007-11-06 | Microsoft Corporation | Method and system for identifying image relatedness using link and page layout analysis |
CN1702651A (en) * | 2004-05-24 | 2005-11-30 | 富士通株式会社 | Recognition method and apparatus for information files of specific types |
US7565346B2 (en) * | 2004-05-31 | 2009-07-21 | International Business Machines Corporation | System and method for sequence-based subspace pattern clustering |
US7428530B2 (en) * | 2004-07-01 | 2008-09-23 | Microsoft Corporation | Dispersing search engine results by using page category information |
US20060143158A1 (en) * | 2004-12-14 | 2006-06-29 | Ruhl Jan M | Method, system and graphical user interface for providing reviews for a product |
US8117203B2 (en) * | 2005-07-15 | 2012-02-14 | Fetch Technologies, Inc. | Method and system for automatically extracting data from web sites |
US7725451B2 (en) * | 2006-01-23 | 2010-05-25 | Microsoft Corporation | Generating clusters of images for search results |
US7571162B2 (en) * | 2006-03-01 | 2009-08-04 | Microsoft Corporation | Comparative web search |
US7676465B2 (en) * | 2006-07-05 | 2010-03-09 | Yahoo! Inc. | Techniques for clustering structurally similar web pages based on page features |
US7660804B2 (en) * | 2006-08-16 | 2010-02-09 | Microsoft Corporation | Joint optimization of wrapper generation and template detection |
-
2008
- 2008-04-16 US US12/103,712 patent/US8051083B2/en not_active Expired - Fee Related
-
2009
- 2009-04-16 EP EP09731930A patent/EP2291812A4/en not_active Ceased
- 2009-04-16 WO PCT/US2009/040881 patent/WO2009129425A2/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050246296A1 (en) * | 2004-04-29 | 2005-11-03 | Microsoft Corporation | Method and system for calculating importance of a block within a display page |
JP2007080061A (en) * | 2005-09-15 | 2007-03-29 | Univ Of Tsukuba | Retrieval method of web page and clustering method of web page |
US20070208703A1 (en) * | 2006-03-03 | 2007-09-06 | Microsoft Corporation | Web forum crawler |
US20080010291A1 (en) * | 2006-07-05 | 2008-01-10 | Krishna Leela Poola | Techniques for clustering structurally similar web pages |
Non-Patent Citations (1)
Title |
---|
See also references of EP2291812A4 * |
Also Published As
Publication number | Publication date |
---|---|
WO2009129425A2 (en) | 2009-10-22 |
EP2291812A2 (en) | 2011-03-09 |
US20090265363A1 (en) | 2009-10-22 |
US8051083B2 (en) | 2011-11-01 |
EP2291812A4 (en) | 2013-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2009129425A3 (en) | Forum web page clustering based on repetitive regions | |
Iam-On et al. | Linkclue: A matlab package for link-based cluster ensembles | |
ATE450012T1 (en) | COMPUTER-ASSISTED DOCUMENT RETRIEVAL | |
WO2010144671A3 (en) | Trend analysis in content identification based on fingerprinting | |
WO2007084836A3 (en) | Match-based employment system and method | |
WO2012006097A3 (en) | Low confidence face recognition | |
WO2007005975A3 (en) | Risk modeling system | |
GB2457515A (en) | Similarity detection and clustering of images | |
WO2006099971A3 (en) | Multi-ply security paper | |
WO2010008800A3 (en) | Query identification and association | |
TW200739371A (en) | Information processing apparatus and method, and a computer readable storage medium encoded with a computer program | |
WO2012177794A3 (en) | Identifying information related to a particular entity from electronic sources, using dimensional reduction and quantum clustering | |
WO2008085458A3 (en) | Irregular input identification | |
ATE479384T1 (en) | BODY PARAMETERS DETECTION | |
WO2010063463A3 (en) | Face recognition using face detection classifier data | |
WO2010141799A3 (en) | Feature engineering and user behavior analysis | |
EP2671189A1 (en) | Systems and methods for biometric identification | |
IL192727A0 (en) | Method and apparatuses for extending dynamic handwriting recognition to recognize static handwritten and machine generated text | |
WO2007059232A3 (en) | Methods and apparatus for probe-based clustering | |
WO2009120263A3 (en) | System and method for illumination invariant image segmentation | |
WO2008152277A3 (en) | Method of handling transmittals including a graphic classification of the signatures associated with the transmittals | |
EP1501047A3 (en) | Fingerprinting segments of data content for version identification | |
WO2009108871A3 (en) | Feedback systems and methods for recognizing patterns | |
WO2009060888A1 (en) | Author's influence determination system, author's influence determination method, and program | |
WO2009065146A3 (en) | System and method for conducting a patent search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09731930 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009731930 Country of ref document: EP |