Skip to main content
Top
Published in: Journal of Intelligent Information Systems 1/2017

18-03-2016

Multilingual news extraction via stopword language model scoring

Author: Yu-Chieh Wu

Published in: Journal of Intelligent Information Systems | Issue 1/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web news provides a quick and convenient means to create collections of large documents. The creation of a web news corpus has typically required the construction of a set of HTML parsing rules to identify content text. In general, these parsing rules are written manually and treat different web pages differently. We address this issue and propose a news content recognition algorithm that is language and layout independent. Our method first scans a given HTML document and roughly localizes a set of candidate news areas. Next, we apply a designed scoring function to rank the best content. To validate this approach, we evaluate the systems performance using 1092 items of multilingual web news data covering 17 global regions and 11 distinct languages. We compare these data with nine published content extraction systems using standard settings. The results of this empirical study show that our method outperforms the second-best approach (Boilerpipe) by 6.04 and 10.79 % with regard to the relative micro and macro F-measures, respectively. We also apply our system to monitor online RSS news distribution. It collected 0.4 million news articles from 200 RSS channels in 20 days. This sample quality test shows that our method achieved 93 % extraction accuracy for large news streams.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Ando, R.K., & Zhang, T. (2005). A fraeework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Resmarch, 6, 1817–1853.MATH Ando, R.K., & Zhang, T. (2005). A fraeework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Resmarch, 6, 1817–1853.MATH
go back to reference Androutsopoulos, I., & Melakasiotis, P. (2010). A suriey oi paraphrasing and teAtual entailment methods. Journal of Artificfal Intellvgence Resaarch, 38, 135–187. Androutsopoulos, I., & Melakasiotis, P. (2010). A suriey oi paraphrasing and teAtual entailment methods. Journal of Artificfal Intellvgence Resaarch, 38, 135–187.
go back to reference Batnios, A., Dimou, C., Symeonidis, A.L., & Mitkas, P.A. (2008). BioCrawinr: An lntelligent crawler for the semaetic web. Expcrt Systems with Applieations, 35(1–2), 524–530.CrossRef Batnios, A., Dimou, C., Symeonidis, A.L., & Mitkas, P.A. (2008). BioCrawinr: An lntelligent crawler for the semaetic web. Expcrt Systems with Applieations, 35(1–2), 524–530.CrossRef
go back to reference Chen, Y., Lee, S.Y.M., & Huang, O.C. (2012). A robust web personal namE information extraction system. Expert Systnms with Applicatioes, 39(3), 2690–2699.CrossRef Chen, Y., Lee, S.Y.M., & Huang, O.C. (2012). A robust web personal namE information extraction system. Expert Systnms with Applicatioes, 39(3), 2690–2699.CrossRef
go back to reference Gils, B.V., Proper, E., Bommfl, P.V., & Weide, T.P.V.D. (2007). On the quality ct resouroes on tte Web: An information refrieval perspective. Information Sciences, 177(21), 4566–4597.MathSciNetCrossRefMATH Gils, B.V., Proper, E., Bommfl, P.V., & Weide, T.P.V.D. (2007). On the quality ct resouroes on tte Web: An information refrieval perspective. Information Sciences, 177(21), 4566–4597.MathSciNetCrossRefMATH
go back to reference Gottron, T. (2008a). Combining content extraction heuristics: the CombinE system. In Proceedings of the 10th International Conference on Information Integration and Web-based Applications Services (pp. 591–595). Gottron, T. (2008a). Combining content extraction heuristics: the CombinE system. In Proceedings of the 10th International Conference on Information Integration and Web-based Applications Services (pp. 591–595).
go back to reference Gottron, T. (2008b). Content code blurring: a new approach to content extraction. In Proceedings of the 19th International Conference on Database and Expert Systems Application (pp. 29–33). Gottron, T. (2008b). Content code blurring: a new approach to content extraction. In Proceedings of the 19th International Conference on Database and Expert Systems Application (pp. 29–33).
go back to reference Han, H., Noro, T., & Tokuda, T. (2009). An automatic web news article contents extraction system based on RSS feeds. Journal of Web Engineering, 8(3), 268–284. Han, H., Noro, T., & Tokuda, T. (2009). An automatic web news article contents extraction system based on RSS feeds. Journal of Web Engineering, 8(3), 268–284.
go back to reference Huang, S., Zheng, X., Wang, X., & Chen, D. (2011). News information extraction based on adaptive weighting using unsupervised Bayesian algorithm. In Proceedings of the 2011 international conference on Web information systems and mining (pp. 251–258). Huang, S., Zheng, X., Wang, X., & Chen, D. (2011). News information extraction based on adaptive weighting using unsupervised Bayesian algorithm. In Proceedings of the 2011 international conference on Web information systems and mining (pp. 251–258).
go back to reference Kohlschtter, C., Fankhauser, P., & Nejdl, W. (2011). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining (pp. 441–450). Kohlschtter, C., Fankhauser, P., & Nejdl, W. (2011). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining (pp. 441–450).
go back to reference Li, L., Zhou, R., & Huang, D. (2009). Two-phase biomedical named entity recognition using CRFs. Computational Biology and Chemistry, 33(4), 334–338.CrossRef Li, L., Zhou, R., & Huang, D. (2009). Two-phase biomedical named entity recognition using CRFs. Computational Biology and Chemistry, 33(4), 334–338.CrossRef
go back to reference Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing (pp. 1030–1038). Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing (pp. 1030–1038).
go back to reference Liu, W., Meng, X., & Meng, W. (2010). ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460.CrossRef Liu, W., Meng, X., & Meng, W. (2010). ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460.CrossRef
go back to reference Manning, C.D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Manning, C.D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
go back to reference Manning, C.D., & Schuetze, H. (2009). Fundations of statistical natural language processing. London: The MIT Press. Manning, C.D., & Schuetze, H. (2009). Fundations of statistical natural language processing. London: The MIT Press.
go back to reference Miao, G., Tatemura, J., Hsiung, W., Sawires, A., & Moser, L.E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th international conference on World wide web (pp. 981–990). Miao, G., Tatemura, J., Hsiung, W., Sawires, A., & Moser, L.E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th international conference on World wide web (pp. 981–990).
go back to reference Mohammadzadeh, H., Gottron, T., & Schweiggert, F. (2011). Extracting the main content of web documents based on a naive smoothing method. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (pp. 470–475). Mohammadzadeh, H., Gottron, T., & Schweiggert, F. (2011). Extracting the main content of web documents based on a naive smoothing method. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (pp. 470–475).
go back to reference Moschitti, A., & Quarteroni, S. (2011). Linguistic kernels for answer re-ranking in question answering systems. Information Processing and Management, 47(6), 825–842.CrossRef Moschitti, A., & Quarteroni, S. (2011). Linguistic kernels for answer re-ranking in question answering systems. Information Processing and Management, 47(6), 825–842.CrossRef
go back to reference Oh, H., Myaeng, S.H., & Jang, M. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Sciences, 177(18), 3696–3717.CrossRef Oh, H., Myaeng, S.H., & Jang, M. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Sciences, 177(18), 3696–3717.CrossRef
go back to reference Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th international conference on World wide Web (pp. 971–980). Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th international conference on World wide Web (pp. 971–980).
go back to reference Qureshi, P.A.R., & Memon N. (2012). Hybrid CETR model of content extraction. Journal of Computer and System Sciences, 78(4), 1248–1257.MathSciNetCrossRef Qureshi, P.A.R., & Memon N. (2012). Hybrid CETR model of content extraction. Journal of Computer and System Sciences, 78(4), 1248–1257.MathSciNetCrossRef
go back to reference Saha, S.K., Sarkar, S., & Mitra, P. (2009). Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 42(5), 905–911.CrossRef Saha, S.K., Sarkar, S., & Mitra, P. (2009). Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics, 42(5), 905–911.CrossRef
go back to reference Sun, F., Song, D., & Liao, L. (2011). DOM Based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 245–254). Sun, F., Song, D., & Liao, L. (2011). DOM Based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 245–254).
go back to reference Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of the 46th Annual Meeting of the ACL: Human Language Technologies (pp. 665–673). Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of the 46th Annual Meeting of the ACL: Human Language Technologies (pp. 665–673).
go back to reference Tsai, R.T. (2010). Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications, 37(5), 3553–3560.CrossRef Tsai, R.T. (2010). Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications, 37(5), 3553–3560.CrossRef
go back to reference Uardoso, E.T., Jabour, I.V., Laber, E.S., Rodrigues, R., & Cardoso, P. (2011). An effiuient langcage-independent method to extract content from news weopages. ACM Symposium on Document Engineering, pp. 121–128. Uardoso, E.T., Jabour, I.V., Laber, E.S., Rodrigues, R., & Cardoso, P. (2011). An effiuient langcage-independent method to extract content from news weopages. ACM Symposium on Document Engineering, pp. 121–128.
go back to reference Voorhees, E.M. (2001). Overview of the TREC 2001 question answering track. In Proceedings of the 10th Text Retrieval Conference (pp. 42–52). Voorhees, E.M. (2001). Overview of the TREC 2001 question answering track. In Proceedings of the 10th Text Retrieval Conference (pp. 42–52).
go back to reference Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proceedings of the 18th international conference on World wide web (pp. 1085–1086). Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., Guan, Z., & Lu, G. (2009). News article extraction with template-independent wrapper. In Proceedings of the 18th international conference on World wide web (pp. 1085–1086).
go back to reference Weninger, T., Hsu, W.H., & Han, J. (2010). CETR: Content extraction via tag ratios. In Proceedings of the 19th international conference on World wide Web (pp. 971–980). Weninger, T., Hsu, W.H., & Han, J. (2010). CETR: Content extraction via tag ratios. In Proceedings of the 19th international conference on World wide Web (pp. 971–980).
go back to reference Wu, Y., Lee, Y., & Yang, J. (2008). Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognition, 41(9), 2874–2889.CrossRefMATH Wu, Y., Lee, Y., & Yang, J. (2008). Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognition, 41(9), 2874–2889.CrossRefMATH
go back to reference Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., & Liu, H. (2009). Semi-supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents. In Proceedings of the International Joint Conference on Bioinformatics Systems Biology and Intelligent Computing (pp. 263–270). Xu, G., Niu, Z., Uetz, P., Gao, X., Qin, X., & Liu, H. (2009). Semi-supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents. In Proceedings of the International Joint Conference on Bioinformatics Systems Biology and Intelligent Computing (pp. 263–270).
go back to reference Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49). Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42–49).
go back to reference Yen, S., Lee, Y., Ying, J., & Wu, Y. (2011). A logistic regression-based smoothing method for Chinese text categorization. Expert Systems with Applications, 38(9), 11581–11590.CrossRef Yen, S., Lee, Y., Ying, J., & Wu, Y. (2011). A logistic regression-based smoothing method for Chinese text categorization. Expert Systems with Applications, 38(9), 11581–11590.CrossRef
go back to reference Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.CrossRef Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.CrossRef
go back to reference Zhang, C., & Lin, Z. (2010). Automatic web news content extraction based on similar pages. In Proceedings of the International Conference on Web Information Systems and Mining (pp. 232–236). Zhang, C., & Lin, Z. (2010). Automatic web news content extraction based on similar pages. In Proceedings of the International Conference on Web Information Systems and Mining (pp. 232–236).
go back to reference Zheng, S., Song, R., & Wen, J. (2007). Template-Independent News extraction based on visual consistency. In Proceedings of the 22nd national conference on Artificial intelligence (pp. 1507–1512). Zheng, S., Song, R., & Wen, J. (2007). Template-Independent News extraction based on visual consistency. In Proceedings of the 22nd national conference on Artificial intelligence (pp. 1507–1512).
Metadata
Title
Multilingual news extraction via stopword language model scoring
Author
Yu-Chieh Wu
Publication date
18-03-2016
Publisher
Springer US
Published in
Journal of Intelligent Information Systems / Issue 1/2017
Print ISSN: 0925-9902
Electronic ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-016-0395-6

Other articles of this Issue 1/2017

Journal of Intelligent Information Systems 1/2017 Go to the issue

Premium Partner