2015 | OriginalPaper | Chapter
Enhanced Web Page Cleaning for Constructing Social Media Text Corpora
Authors : Melanie Neunerdt, Eva Reimer, Michael Reyer, Rudolf Mathar
Published in: Information Science and Applications
Publisher: Springer Berlin Heidelberg
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
Web page cleaning is one of the most essential tasks in Web corpus construction. The intention is to separate the main content from navigational elements, templates, and advertisements, often referred to as
boilerplate.
In this paper, we particularly enhance Web page cleaning applied to pages containing comments and introduce a new training corpus for that purpose. Beside extending an existing boilerplate detection algorithm by means of a comment classifier, we train and test different classifiers on extended feature sets solving a two-class problem (content vs. boilerplate) on our and an existing benchmark corpus. Results show that the proposed approach outperforms existing methods, particularly on comment pages from different domains. Finally, we point out that our trained classifiers are domain independent and with small adjustments only transferable to other languages.