Top

Published in:

2015 | OriginalPaper | Chapter

Enhanced Web Page Cleaning for Constructing Social Media Text Corpora

Authors : Melanie Neunerdt, Eva Reimer, Michael Reyer, Rudolf Mathar

Published in: Information Science and Applications

Publisher: Springer Berlin Heidelberg

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Web page cleaning is one of the most essential tasks in Web corpus construction. The intention is to separate the main content from navigational elements, templates, and advertisements, often referred to as

boilerplate.

In this paper, we particularly enhance Web page cleaning applied to pages containing comments and introduce a new training corpus for that purpose. Beside extending an existing boilerplate detection algorithm by means of a comment classifier, we train and test different classifiers on extended feature sets solving a two-class problem (content vs. boilerplate) on our and an existing benchmark corpus. Results show that the proposed approach outperforms existing methods, particularly on comment pages from different domains. Finally, we point out that our trained classifiers are domain independent and with small adjustments only transferable to other languages.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Evolutionary Circular-ELM for the Reduced-Reference Assessment of Perceived Image Quality

next chapter Finding Knee Solutions in Multi-Objective Optimization Using Extended Angle Dominance Approach

Title: Enhanced Web Page Cleaning for Constructing Social Media Text Corpora
Authors: Melanie Neunerdt
Eva Reimer
Michael Reyer
Rudolf Mathar
Publisher: Springer Berlin Heidelberg
Book: Information Science and Applications
Print ISBN: 978-3-662-46577-6

Electronic ISBN: 978-3-662-46578-3

Copyright Year: 2015
DOI: https://doi.org/10.1007/978-3-662-46578-3_78

Springer Professional

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner