Skip to main content
Erschienen in: Information Systems Frontiers 5/2014

01.11.2014

An FAR-SW based approach for webpage information extraction

verfasst von: Zhan Bu, Chengcui Zhang, Zhengyou Xia, Jiandong Wang

Erschienen in: Information Systems Frontiers | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. ISBN:1-55860-153-8. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, 487–499. ISBN:1-55860-153-8.
Zurück zum Zitat Bu, Z., Xia, Z., & Wang, J. (2013a). A sock puppet detection algorithm on virtual spaces. Knowledge Based Systems, 37, 366–377.CrossRef Bu, Z., Xia, Z., & Wang, J. (2013a). A sock puppet detection algorithm on virtual spaces. Knowledge Based Systems, 37, 366–377.CrossRef
Zurück zum Zitat Bu, Z., Xia, Z., Wang, J., & Zhang, C. (2013). A last updating evolution model for online social networks. Physica A: Statistical Mechanics and its Applications. [Available online 17 January 2013]. Bu, Z., Xia, Z., Wang, J., & Zhang, C. (2013). A last updating evolution model for online social networks. Physica A: Statistical Mechanics and its Applications. [Available online 17 January 2013].
Zurück zum Zitat Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific web conference on web technologies and applications. 406–417. ISBN:3-540-02354-2. Cai, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific web conference on web technologies and applications. 406–417. ISBN:3-540-02354-2.
Zurück zum Zitat Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. Cambridge: Cambridge University. Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines. Cambridge: Cambridge University.
Zurück zum Zitat Eduardo, S. L., Críston, P. S., Iam, V. J., Evelin, C. F. A., Eduardo, T. C., Raúl, P. R., et al. (2009). A fast and simple method for extracting relevant content from news webpages. In Proceeding of CIKM, 1685–1688. Eduardo, S. L., Críston, P. S., Iam, V. J., Evelin, C. F. A., Eduardo, T. C., Raúl, P. R., et al. (2009). A fast and simple method for extracting relevant content from news webpages. In Proceeding of CIKM, 1685–1688.
Zurück zum Zitat Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. In Proceedings of the 14th international conference on WWW, 830–839. doi:10.1145/1062745.1062763. Gibson, D., Punera, K., & Tomkins, A. (2005). The volume and evolution of web page templates. In Proceedings of the 14th international conference on WWW, 830–839. doi:10.​1145/​1062745.​1062763.
Zurück zum Zitat Gupta, S., Kaiser, G., Neistadt, D., & Grimm, P. (2003). DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on WWW, 207–214. doi:10.1145/775152.775182 Gupta, S., Kaiser, G., Neistadt, D., & Grimm, P. (2003). DOM-based content extraction of HTML documents. In Proceedings of the 12th international conference on WWW, 207–214. doi:10.​1145/​775152.​775182
Zurück zum Zitat Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8(1), 53–87.CrossRef Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8(1), 53–87.CrossRef
Zurück zum Zitat Hegland, M. (2005). The Apriori Algorithm—a tutorial. In Mathematics and computation in imaging science and information processing. Singapore: World Scientific Publishing Co. Pte. Ltd. Hegland, M. (2005). The Apriori Algorithm—a tutorial. In Mathematics and computation in imaging science and information processing. Singapore: World Scientific Publishing Co. Pte. Ltd.
Zurück zum Zitat Kang, J., Yang, J., & Choi, J. (2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics, 56(2), 980–986.CrossRef Kang, J., Yang, J., & Choi, J. (2010). Repetition-based web page segmentation by detecting tag patterns for small-screen devices. IEEE Transactions on Consumer Electronics, 56(2), 980–986.CrossRef
Zurück zum Zitat Kao, H.-Y., Ho, J.-M., & Chen, M.-S. (2005). WISDOM: web intrapage informative structure mining based on document object model. IEEE Transactions on Knowledge and Data Engineering, 17(5), 614–627.CrossRef Kao, H.-Y., Ho, J.-M., & Chen, M.-S. (2005). WISDOM: web intrapage informative structure mining based on document object model. IEEE Transactions on Knowledge and Data Engineering, 17(5), 614–627.CrossRef
Zurück zum Zitat Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.CrossRef Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, 4(2), 4–22.CrossRef
Zurück zum Zitat Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Zurück zum Zitat Xia, Z., & Bu, Z. (2012). Community detection based on a semantic network. Knowledge Based Systems, 26, 30–39.CrossRef Xia, Z., & Bu, Z. (2012). Community detection based on a semantic network. Knowledge Based Systems, 26, 30–39.CrossRef
Zurück zum Zitat Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.CrossRef Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372–390.CrossRef
Zurück zum Zitat Zhou, B., Xiong, Y., & Liu, W. (2009). Efficient web page main text extraction towards online news analysis. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering, 37–41. doi:10.1109/ICEBE.2009.15. Zhou, B., Xiong, Y., & Liu, W. (2009). Efficient web page main text extraction towards online news analysis. In Proceedings of the 2009 IEEE International Conference on e-Business Engineering, 37–41. doi:10.​1109/​ICEBE.​2009.​15.
Metadaten
Titel
An FAR-SW based approach for webpage information extraction
verfasst von
Zhan Bu
Chengcui Zhang
Zhengyou Xia
Jiandong Wang
Publikationsdatum
01.11.2014
Verlag
Springer US
Erschienen in
Information Systems Frontiers / Ausgabe 5/2014
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-013-9412-2

Weitere Artikel der Ausgabe 5/2014

Information Systems Frontiers 5/2014 Zur Ausgabe