Skip to main content
Erschienen in: Programming and Computer Software 5/2018

01.09.2018

Extraction of Data from Mass Media Web Sites

verfasst von: A. K. Yatskov, M. I. Varlamov, D. Yu. Turdakov

Erschienen in: Programming and Computer Software | Ausgabe 5/2018

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

To understand the current state and dynamics of the development of the Internet information space, fast tools for extracting data for mass media sites that have a large degree of coverage are needed. However, by no means all sites provide data syndication in the RSS format, and the development of specialized tools for extracting data from each Web site is a costly procedure. In this paper, methods for automatic extraction of news texts from arbitrary mass media sites are proposed. Due to classification of Web page types and the subsequent grouping of their URLs, the quality of extracting news texts is improved. A strategy for traversing a site and detecting the pages containing hyperlinks to news pages is proposed. This strategy decreases the number of requests and reduces the site load.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Tomàs, J. C., Amann, B., Travers, N., and Vodislav, D., “Roses: A continuous content-based query engine for RSS feeds,” in Int. Conf. on Database and Expert Systems Applications, Springer, 2011, pp. 203–218. Tomàs, J. C., Amann, B., Travers, N., and Vodislav, D., “Roses: A continuous content-based query engine for RSS feeds,” in Int. Conf. on Database and Expert Systems Applications, Springer, 2011, pp. 203–218.
2.
Zurück zum Zitat Vouzoukidou, N., Amann, B., and Christophides, V., “Continuous top-k queries over real-time web streams,” arXiv preprint arXiv:1610.06500, 2016. Vouzoukidou, N., Amann, B., and Christophides, V., “Continuous top-k queries over real-time web streams,” arXiv preprint arXiv:1610.06500, 2016.
3.
Zurück zum Zitat Gogar, T., Hubacek, O., and Sedivy, J., “Deep neural networks for web page information extraction,” in IFIP Int. Conf. on Artificial Intelligence Applications and Innovations, Springer, 2016, pp. 154–163. Gogar, T., Hubacek, O., and Sedivy, J., “Deep neural networks for web page information extraction,” in IFIP Int. Conf. on Artificial Intelligence Applications and Innovations, Springer, 2016, pp. 154–163.
4.
Zurück zum Zitat Kohlschütter, C., Fankhauser, P., and Nejdl, W., “Boilerplate detection using shallow text features,” in Proc. of the third ACM Int. Conf. on Web search and data mining, ACM, 2010, pp. 441–450. Kohlschütter, C., Fankhauser, P., and Nejdl, W., “Boilerplate detection using shallow text features,” in Proc. of the third ACM Int. Conf. on Web search and data mining, ACM, 2010, pp. 441–450.
5.
Zurück zum Zitat Satpal, S., Bhadra, S., Sellamanickam, S., et al., “Web information extraction using Markov logic networks,” in Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining,” ACM, 2011, pp. 1406–1414. Satpal, S., Bhadra, S., Sellamanickam, S., et al., “Web information extraction using Markov logic networks,” in Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining,” ACM, 2011, pp. 1406–1414.
6.
Zurück zum Zitat Furche, T., Gottlob, G. Grasso, G., et al., “Diadem: Domain-centric, intelligent, automated data extraction methodology,” Proc. of the 21st Int. Conf. on World Wide Web, ACM, 2012, pp. 267–270. Furche, T., Gottlob, G. Grasso, G., et al., “Diadem: Domain-centric, intelligent, automated data extraction methodology,” Proc. of the 21st Int. Conf. on World Wide Web, ACM, 2012, pp. 267–270.
7.
Zurück zum Zitat Subercaze, J., Gravier, C., and Laforest, F., “Mining user-generated comments,” in IEEE/WIC/ACM Int. Conf. on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE, 2015, Vol. 1, pp. 45–52. Subercaze, J., Gravier, C., and Laforest, F., “Mining user-generated comments,” in IEEE/WIC/ACM Int. Conf. on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE, 2015, Vol. 1, pp. 45–52.
8.
Zurück zum Zitat Yang, J.-M., Cai, R., Y. Wang, et al., “Incorporating site-level knowledge to extract structured data from web forums,” in Proc. of the 18th Int. Conf. on World wide web, ACM, 2009, pp. 181–190. Yang, J.-M., Cai, R., Y. Wang, et al., “Incorporating site-level knowledge to extract structured data from web forums,” in Proc. of the 18th Int. Conf. on World wide web, ACM, 2009, pp. 181–190.
9.
Zurück zum Zitat Song, X., Liu, J., Cao, Y., et al., “Automatic extraction of web data records containing user-generated content,” in Proc. of the 19th ACM Int. Conf. on Information and knowledge management, ACM, 2010, pp. 39–48. Song, X., Liu, J., Cao, Y., et al., “Automatic extraction of web data records containing user-generated content,” in Proc. of the 19th ACM Int. Conf. on Information and knowledge management, ACM, 2010, pp. 39–48.
10.
Zurück zum Zitat Schulz, A., Lässig, J., and Gaedke, M., “Practical web data extraction: Are we there yet? A short survey,” IEEE/WIC/ACM Int. Conf. on Web Intelligence (WI), 2016, IEEE, 2016, pp. 562–567. Schulz, A., Lässig, J., and Gaedke, M., “Practical web data extraction: Are we there yet? A short survey,” IEEE/WIC/ACM Int. Conf. on Web Intelligence (WI), 2016, IEEE, 2016, pp. 562–567.
11.
Zurück zum Zitat Varlamov, M. I. and Turdakov, D. Y., “A survey of web resources,” Program. Comput. Software, 2016, vol. 42, no. 5, pp. 279–291.CrossRef Varlamov, M. I. and Turdakov, D. Y., “A survey of web resources,” Program. Comput. Software, 2016, vol. 42, no. 5, pp. 279–291.CrossRef
12.
Zurück zum Zitat Reis, D. d. C., Golgher, P. B., Silva, A. S., and Laender, A., “Automatic web news extraction using tree edit distance,” in Proc. of the 13th Int. Conf. on World Wide Web, ACM, 2004, pp. 502–511. Reis, D. d. C., Golgher, P. B., Silva, A. S., and Laender, A., “Automatic web news extraction using tree edit distance,” in Proc. of the 13th Int. Conf. on World Wide Web, ACM, 2004, pp. 502–511.
13.
Zurück zum Zitat Vogels, T., Ganea, O.-E., and Eickhoff, C., “Web2text: Deep structured boilerplate removal,” arXiv preprint arXiv:1801.02607, 2018. Vogels, T., Ganea, O.-E., and Eickhoff, C., “Web2text: Deep structured boilerplate removal,” arXiv preprint arXiv:1801.02607, 2018.
14.
Zurück zum Zitat “Cleaneval: A competition for cleaning web pages,” Baroni, M., Chantree, F., Kilgarriff, A., et al, LREC, 2008. “Cleaneval: A competition for cleaning web pages,” Baroni, M., Chantree, F., Kilgarriff, A., et al, LREC, 2008.
15.
Zurück zum Zitat “Vips: A vision-based page segmentation algorithm,” Cai, D., Yu, S., Wen J.-R., and Ma, W.-Y., 2003. “Vips: A vision-based page segmentation algorithm,” Cai, D., Yu, S., Wen J.-R., and Ma, W.-Y., 2003.
16.
Zurück zum Zitat Zheng, S., Song, R., and Wen, J.-R., “Template independent news extraction based on visual consistency,” AAAI, vol. 7, 2007, pp. 1507–1513. Zheng, S., Song, R., and Wen, J.-R., “Template independent news extraction based on visual consistency,” AAAI, vol. 7, 2007, pp. 1507–1513.
17.
Zurück zum Zitat “News article extraction with template independent wrapper,” Wang, J. He, X, Wang, C., Proc. of the 18th Int. Conf. on World wide web, ACM, 2009, pp. 1085–1086. “News article extraction with template independent wrapper,” Wang, J. He, X, Wang, C., Proc. of the 18th Int. Conf. on World wide web, ACM, 2009, pp. 1085–1086.
18.
Zurück zum Zitat Jiang, J., Song, X., Yu, N., and Lin, C.-Y., “Focus: Learning to crawl web forums,” IEEE Trans. Knowl. Data Eng., 2013, vol. 25, no. 6, pp. 1293–1306.CrossRef Jiang, J., Song, X., Yu, N., and Lin, C.-Y., “Focus: Learning to crawl web forums,” IEEE Trans. Knowl. Data Eng., 2013, vol. 25, no. 6, pp. 1293–1306.CrossRef
19.
Zurück zum Zitat Pretzsch, S., Muthmann, K., and Schill, A., “Fodex–towards generic data extraction from web forums,” in 26th Int. Conf. on Advanced Information Networking and Applications Workshops (WAINA), IEEE, 2012, pp. 821–826. Pretzsch, S., Muthmann, K., and Schill, A., “Fodex–towards generic data extraction from web forums,” in 26th Int. Conf. on Advanced Information Networking and Applications Workshops (WAINA), IEEE, 2012, pp. 821–826.
20.
Zurück zum Zitat Barbosa, L., “Harvesting forum pages from seed sites,” Int. Conf. on Web Engineering, Springer, 2017, pp. 457–468. Barbosa, L., “Harvesting forum pages from seed sites,” Int. Conf. on Web Engineering, Springer, 2017, pp. 457–468.
21.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., et al. “Scikit-learn: Machine learning in Python,” J. Mach. Learning Res., 2011, vol. 12, pp. 2825–2830.MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., et al. “Scikit-learn: Machine learning in Python,” J. Mach. Learning Res., 2011, vol. 12, pp. 2825–2830.MathSciNetMATH
22.
Zurück zum Zitat Ferrara, E., De Meo, P., Fiumara, G., and Baumgartner, R., “Web data extraction, applications and techniques: A survey,” Knowledge-based Syst., 2014, vol. 70, pp. 301–323. Ferrara, E., De Meo, P., Fiumara, G., and Baumgartner, R., “Web data extraction, applications and techniques: A survey,” Knowledge-based Syst., 2014, vol. 70, pp. 301–323.
23.
Zurück zum Zitat Barbosa, L. and Ferreira, G., “Extracting records and posts from forum pages with limited supervision,” in Int. Conf. on Web Information Systems Engineering, Springer, 2015, pp. 233–240. Barbosa, L. and Ferreira, G., “Extracting records and posts from forum pages with limited supervision,” in Int. Conf. on Web Information Systems Engineering, Springer, 2015, pp. 233–240.
Metadaten
Titel
Extraction of Data from Mass Media Web Sites
verfasst von
A. K. Yatskov
M. I. Varlamov
D. Yu. Turdakov
Publikationsdatum
01.09.2018
Verlag
Pleiades Publishing
Erschienen in
Programming and Computer Software / Ausgabe 5/2018
Print ISSN: 0361-7688
Elektronische ISSN: 1608-3261
DOI
https://doi.org/10.1134/S0361768818050092

Weitere Artikel der Ausgabe 5/2018

Programming and Computer Software 5/2018 Zur Ausgabe