Skip to main content

2017 | OriginalPaper | Buchkapitel

Harvesting Forum Pages from Seed Sites

verfasst von : Luciano Barbosa

Erschienen in: Web Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web forums are rich sources of conversational content. Many applications, such as opinion mining and question answering, can greatly benefit from mining and exploring such useful content. A key step towards making this content more easily available is to collect conversational pages on forum sites – so-called thread pages. In this paper, we propose a two-step crawling solution for the problem of collecting thread pages in large scale. First, since thread pages are located within forum sites, we propose an inter-site crawler that locates forum sites on the Web. To do that, the inter-site crawler focuses on the Web graph neighbourhood of forum sites, and explores the content patterns of the links in this region to guide its visitation policy. Next, to collect thread pages within the discovered forum sites, we propose an intra-site crawler that finds thread pages by learning the context of links that lead to those pages and, to detect them, relies on their content and structural features. Experimental results demonstrate that both the inter-site and the intra-site crawlers are effective and obtain superior performance in comparison to their baselines.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Barbosa, L., Bangalore, S., Sridhar, V.K.R.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 429–437 (2011) Barbosa, L., Bangalore, S., Sridhar, V.K.R.: Crawling back and forth: using back and out links to locate bilingual sites. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 429–437 (2011)
2.
Zurück zum Zitat Barbosa, L., Ferreira, G.: Extracting records and posts from forum pages with limited supervision. In: Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.-C., Li, T., Zhang, Y. (eds.) WISE 2015. LNCS, vol. 9419, pp. 233–240. Springer, Cham (2015). doi:10.1007/978-3-319-26187-4_19 CrossRef Barbosa, L., Ferreira, G.: Extracting records and posts from forum pages with limited supervision. In: Wang, J., Cellary, W., Wang, D., Wang, H., Chen, S.-C., Li, T., Zhang, Y. (eds.) WISE 2015. LNCS, vol. 9419, pp. 233–240. Springer, Cham (2015). doi:10.​1007/​978-3-319-26187-4_​19 CrossRef
3.
Zurück zum Zitat Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005) Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
4.
Zurück zum Zitat Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: iRobot: an intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM (2008) Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: iRobot: an intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM (2008)
5.
Zurück zum Zitat Cong, G., Wang, L., Lin, C.-Y., Song, Y.-I., Sun, Y.: Finding question-answer pairs from online forums. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM (2008) Cong, G., Wang, L., Lin, C.-Y., Song, Y.-I., Sun, Y.: Finding question-answer pairs from online forums. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM (2008)
6.
Zurück zum Zitat Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000) Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
7.
Zurück zum Zitat Guo, Y., Li, K., Zhang, K., Zhang, G.: Board forum crawling: a web crawling method for web forum. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 745–748. IEEE Computer Society (2006) Guo, Y., Li, K., Zhang, K., Zhang, G.: Board forum crawling: a web crawling method for web forum. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 745–748. IEEE Computer Society (2006)
8.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
9.
Zurück zum Zitat Jiang, J., Song, X., Yu, N., Lin, C.-Y.: Focus: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)CrossRef Jiang, J., Song, X., Yu, N., Lin, C.-Y.: Focus: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)CrossRef
10.
Zurück zum Zitat Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 381–390. ACM (2010) Koppula, H.S., Leela, K.P., Agarwal, A., Chitrapura, K.P., Garg, S., Sasturkar, A.: Learning URL patterns for webpage de-duplication. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 381–390. ACM (2010)
11.
Zurück zum Zitat Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008)CrossRef Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008)CrossRef
12.
Zurück zum Zitat Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999) Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10(3), 61–74 (1999)
13.
Zurück zum Zitat Seo, J., Croft, W.B., Smith, D.A.: Online community search using thread structure. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1907–1910. ACM (2009) Seo, J., Croft, W.B., Smith, D.A.: Online community search using thread structure. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1907–1910. ACM (2009)
14.
Zurück zum Zitat Vidal, M.L., da Silva, A.S., de Moura, E.S., Cavalcanti, J.: Structure-driven crawler generation by example. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299. ACM (2006) Vidal, M.L., da Silva, A.S., de Moura, E.S., Cavalcanti, J.: Structure-driven crawler generation by example. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299. ACM (2006)
15.
Zurück zum Zitat Wang, H., Wang, C., Zhai, C., Han, J.: Learning online discussion structures by conditional random fields. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–444. ACM (2011) Wang, H., Wang, C., Zhai, C., Han, J.: Learning online discussion structures by conditional random fields. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–444. ACM (2011)
16.
Zurück zum Zitat Wang, Y., Yang, J.-M., Lai, W., Cai, R., Zhang, L., Ma, W.-Y.: Exploring traversal strategy for web forum crawling. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 459–466. ACM (2008) Wang, Y., Yang, J.-M., Lai, W., Cai, R., Zhang, L., Ma, W.-Y.: Exploring traversal strategy for web forum crawling. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 459–466. ACM (2008)
17.
Zurück zum Zitat Webber, B., Webb, N.: Question answering. In: The Handbook of Computational Linguistics and Natural Language Processing, pp. 630–654 (2010) Webber, B., Webb, N.: Question answering. In: The Handbook of Computational Linguistics and Natural Language Processing, pp. 630–654 (2010)
Metadaten
Titel
Harvesting Forum Pages from Seed Sites
verfasst von
Luciano Barbosa
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-60131-1_32