nach oben

Information Systems Frontiers

Erschienen in:

01.11.2014

An FAR-SW based approach for webpage information extraction

verfasst von: Zhan Bu, Chengcui Zhang, Zhengyou Xia, Jiandong Wang

Erschienen in: Information Systems Frontiers | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.

Vorheriger Artikel Finding story chains in newswire articles using random walks

Nächster Artikel Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. ISBN:1-55860-153-8.

Alexjc. (2007). The easy way to extract useful text from arbitrary HTML. http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-fromarbitrary-html/. [Accessed 5 April 2007].

Bu, Z., Xia, Z., & Wang, J. (2013a). A sock puppet detection algorithm on virtual spaces. Knowledge Based Systems, 37, 366–377.CrossRef

Bu, Z., Xia, Z., Wang, J., & Zhang, C. (2013). A last updating evolution model for online social networks. Physica A: Statistical Mechanics and its Applications. [Available online 17 January 2013].

Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific web conference on web technologies and applications. 406–417. ISBN:3-540-02354-2.

Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. Cambridge: Cambridge University.

Eduardo, S. L., Críston, P. S., Iam, V. J., Evelin, C. F. A., Eduardo, T. C., Raúl, P. R., et al. (2009). A fast and simple method for extracting relevant content from news webpages. In Proceeding of CIKM, 1685–1688.

Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. In Proceedings of the 14th international conference on WWW, 830–839. doi:10.1145/1062745.1062763.

Gupta, S., Kaiser, G., Neistadt, D., & Grimm, P. (2003). DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on WWW, 207–214. doi:10.1145/775152.775182

Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8(1), 53–87.CrossRef

Hegland, M. (2005). The Apriori Algorithm—a tutorial. In Mathematics and computation in imaging science and information processing. Singapore: World Scientific Publishing Co. Pte. Ltd.

Kang, J., Yang, J., & Choi, J. (2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics, 56(2), 980–986.CrossRef

Kao, H.-Y., Ho, J.-M., & Chen, M.-S. (2005). WISDOM: web intrapage informative structure mining based on document object model. IEEE Transactions on Knowledge and Data Engineering, 17(5), 614–627.CrossRef

Koch, P. P. (2001). The document object model: an introduction. Digital Web Magazine. http://www.digital-web.com/articles/the_document_object_model/. [Accessed 10 January 2009].

Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.CrossRef

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

Theofanos, M. F., & Redish, J. (2003). Guidelines for accessible and usable web sites: observing users who work with screen readers. http://www.redish.net/content/Papers/Interactions.Html/. [Accessed 20 July 2008].

Xia, Z., & Bu, Z. (2012). Community detection based on a semantic network. Knowledge Based Systems, 26, 30–39.CrossRef

Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.CrossRef

Zhou, B., Xiong, Y., & Liu, W. (2009). Efficient web page main text extraction towards online news analysis. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering, 37–41. doi:10.1109/ICEBE.2009.15.

Titel: An FAR-SW based approach for webpage information extraction
verfasst von: Zhan Bu
Chengcui Zhang
Zhengyou Xia
Jiandong Wang
Publikationsdatum: 01.11.2014
Verlag: Springer US
Erschienen in: Information Systems Frontiers / Ausgabe 5/2014
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-013-9412-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5/2014

Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection

A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

Measuring the performance of aspect oriented software: A case study of Leader/Followers and Half-Sync/Half-Async architectures

A study of virtual product consumption from the expectancy disconfirmation and symbolic consumption perspectives

A methodology for the evaluation of high response time on E-commerce users and sales

Guest editorial: Information reuse, integration, and reusable systems