Skip to main content
Top

2018 | OriginalPaper | Chapter

Automatic Web News Extraction Based on DS Theory Considering Content Topics

Authors : Kaihang Zhang, Chuang Zhang, Xiaojun Chen, Jianlong Tan

Published in: Computational Science – ICCS 2018

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In addition to the news content, most news web pages also contain various noises, such as advertisements, recommendations, and navigation panels. These noises may hamper the studies and applications which require pre-processing to extract the news content accurately. Existing methods of news content extraction mostly rely on non-content features, such as tag path, text layout, and DOM structure. However, without considering topics of the news content, these methods are difficult to recognize noises whose external characteristics are similar to those of the news content. In this paper, we propose a method that combines non-content features and a topic feature based on Dempster-Shafer (DS) theory to increase the recognition accuracy. We use maximal compatibility blocks to generate topics from text nodes and then obtain feature values of topics. Each feature is converted into evidence for the DS theory which can be utilized in the uncertain information fusion. Experimental results on English and Chinese web pages show that combining the topic feature by DS theory can improve the extraction performance obviously.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management. pp. 2059–2068. ACM (2013) Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management. pp. 2059–2068. ACM (2013)
2.
go back to reference Weninger, T., Hsu, W.H., Han, J.: Cetr: content extraction via tag ratios. In: Proceedings of the 19th international conference on World wide web. pp. 971–980. ACM (2010) Weninger, T., Hsu, W.H., Han, J.: Cetr: content extraction via tag ratios. In: Proceedings of the 19th international conference on World wide web. pp. 971–980. ACM (2010)
3.
go back to reference Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 245–254. ACM (2011) Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 245–254. ACM (2011)
4.
go back to reference Reis, D.d.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web. pp. 502–511. ACM (2004) Reis, D.d.C., Golgher, P.B., Silva, A.S., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web. pp. 502–511. ACM (2004)
5.
go back to reference Fang, Y., Xie, X., Zhang, X., Cheng, R., Zhang, Z.: Stem: a suffix tree-based method for web data records extraction. Knowledge and Information Systems pp. 1–27 (2017) Fang, Y., Xie, X., Zhang, X., Cheng, R., Zhang, Z.: Stem: a suffix tree-based method for web data records extraction. Knowledge and Information Systems pp. 1–27 (2017)
6.
go back to reference Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: Proceedings of the 27th International Conference on Data Engineering (ICDE). pp. 1209–1220. IEEE (2011) Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: Proceedings of the 27th International Conference on Data Engineering (ICDE). pp. 1209–1220. IEEE (2011)
7.
go back to reference Bing, L., Wong, T.L., Lam, W.: Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Transactions on Internet Technology (TOIT) 16(2), 1–17 (2016)CrossRef Bing, L., Wong, T.L., Lam, W.: Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Transactions on Internet Technology (TOIT) 16(2), 1–17 (2016)CrossRef
8.
go back to reference Charron, B., Hirate, Y., Purcell, D., Rezk, M.: Extracting semantic information for e-commerce. In: Proceedings of the International Semantic Web Conference. pp. 273–290. Springer (2016) Charron, B., Hirate, Y., Purcell, D., Rezk, M.: Extracting semantic information for e-commerce. In: Proceedings of the International Semantic Web Conference. pp. 273–290. Springer (2016)
9.
go back to reference Gali, N., Mariescu-Istodor, R., Fränti, P.: Using linguistic features to automatically extract web page title. Expert Systems with Applications 79, 296–312 (2017)CrossRef Gali, N., Mariescu-Istodor, R., Fränti, P.: Using linguistic features to automatically extract web page title. Expert Systems with Applications 79, 296–312 (2017)CrossRef
10.
go back to reference Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured data: the TSIMMIS experience. In: Proceedings of the East-European Conference on Advances in Databases and Information Systems pp. 1–8 (1997) Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured data: the TSIMMIS experience. In: Proceedings of the East-European Conference on Advances in Databases and Information Systems pp. 1–8 (1997)
11.
go back to reference Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering 36(3), 283–316 (2001)CrossRef Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data & Knowledge Engineering 36(3), 283–316 (2001)CrossRef
12.
go back to reference Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Ifcis International Conference on Cooperative Information Systems. pp. 160–169. IEEE (1997) Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Proceedings of the Ifcis International Conference on Cooperative Information Systems. pp. 160–169. IEEE (1997)
13.
go back to reference Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering. pp. 611–621. IEEE (2000) Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering. pp. 611–621. IEEE (2000)
14.
go back to reference Deng, C., Shipeng, Y., Jirong, W., Wei-Ying, M.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003) Deng, C., Shipeng, Y., Jirong, W., Wei-Ying, M.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79 (2003)
15.
go back to reference Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web. pp. 203–211. ACM (2004) Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the 13th international conference on World Wide Web. pp. 203–211. ACM (2004)
16.
go back to reference Sentz, K., Ferson, S., et al.: Combination of evidence in Dempster-Shafer theory, vol. 4015. Citeseer (2002) Sentz, K., Ferson, S., et al.: Combination of evidence in Dempster-Shafer theory, vol. 4015. Citeseer (2002)
17.
go back to reference Dong, F., Shatz, S.M., Xu, H.: Reasoning under uncertainty for shill detection in online auctions using dempster-shafer theory. International Journal of Software Engineering and Knowledge Engineering 20(07), 943–973 (2010)CrossRef Dong, F., Shatz, S.M., Xu, H.: Reasoning under uncertainty for shill detection in online auctions using dempster-shafer theory. International Journal of Software Engineering and Knowledge Engineering 20(07), 943–973 (2010)CrossRef
18.
go back to reference Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing pp. 404–411 (2004) Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing pp. 404–411 (2004)
Metadata
Title
Automatic Web News Extraction Based on DS Theory Considering Content Topics
Authors
Kaihang Zhang
Chuang Zhang
Xiaojun Chen
Jianlong Tan
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-93698-7_15

Premium Partner