Skip to main content

2015 | OriginalPaper | Buchkapitel

A Method for Topic Detection in Great Volumes of Data

verfasst von : Flora Amato, Francesco Gargiulo, Alessandro Maisto, Antonino Mazzeo, Serena Pelosi, Carlo Sansone

Erschienen in: Data Management Technologies and Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 54–58. IEEE (2008) Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 54–58. IEEE (2008)
2.
Zurück zum Zitat Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., Foster, I.: Toward semantics empowered biomedical web services. In: 2011 IEEE International Conference on Web Services (ICWS), pp. 371–378 (2011) Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., Foster, I.: Toward semantics empowered biomedical web services. In: 2011 IEEE International Conference on Web Services (ICWS), pp. 371–378 (2011)
3.
Zurück zum Zitat Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009) CrossRef Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009) CrossRef
4.
Zurück zum Zitat Seo, Y.W., Sycara, K.: Text clustering for topic detection (2004) Seo, Y.W., Sycara, K.: Text clustering for topic detection (2004)
5.
Zurück zum Zitat Song, Y., Du, J., Hou, L.: A topic detection approach based on multi-level clustering. In: 2012 31st Chinese Control Conference (CCC), pp. 3834–3838. IEEE (2012) Song, Y., Du, J., Hou, L.: A topic detection approach based on multi-level clustering. In: 2012 31st Chinese Control Conference (CCC), pp. 3834–3838. IEEE (2012)
6.
Zurück zum Zitat Zhang, D., Li, S.: Topic detection based on k-means. In: 2011 International Conference on Electronics, Communications and Control (ICECC), pp. 2983–2985 (2011) Zhang, D., Li, S.: Topic detection based on k-means. In: 2011 International Conference on Electronics, Communications and Control (ICECC), pp. 2983–2985 (2011)
7.
Zurück zum Zitat Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) MATH Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) MATH
8.
Zurück zum Zitat Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., Sansone, C.: Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In: HEALTHINF, pp. 382–385 (2013) Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., Sansone, C.: Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In: HEALTHINF, pp. 382–385 (2013)
9.
Zurück zum Zitat Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, pp. 1193–1198. IEEE (2009) Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, pp. 1193–1198. IEEE (2009)
10.
Zurück zum Zitat Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)MATH
11.
Zurück zum Zitat Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Knowledge representation and management for e-government documents. In: Mazzeo, A., Bellini, R., Motta, G. (eds.) E-Government ICT Professionalism and Competences Service Science, pp. 31–40. Springer, USA (2008)CrossRef Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Knowledge representation and management for e-government documents. In: Mazzeo, A., Bellini, R., Motta, G. (eds.) E-Government ICT Professionalism and Competences Service Science, pp. 31–40. Springer, USA (2008)CrossRef
12.
Zurück zum Zitat Amato, F.M., Penta, A., Picariello, A.: Building RDF ontologies from semi-structured legal documents, complex, intelligent and software intensive systems. In: International Conference on CISIS 2008 (2008) Amato, F.M., Penta, A., Picariello, A.: Building RDF ontologies from semi-structured legal documents, complex, intelligent and software intensive systems. In: International Conference on CISIS 2008 (2008)
13.
Zurück zum Zitat Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994) Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994)
14.
Zurück zum Zitat Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000) Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)
15.
Zurück zum Zitat Gargiulo, F., Sansone, C.: SOCIAL: self-organizing classifier ensemble for adversarial learning. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp. 84–93. Springer, Heidelberg (2010) CrossRef Gargiulo, F., Sansone, C.: SOCIAL: self-organizing classifier ensemble for adversarial learning. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp. 84–93. Springer, Heidelberg (2010) CrossRef
16.
Zurück zum Zitat Gargiulo, F., Mazzariello, C., Sansone, C.: Multiple classifier systems: theory, applications and tools. In: Bianchini, M., Maggini, M., Jain, L.C. (eds.) Handbook on Neural Information Processing. ISRL, vol. 49, pp. 335–378. Springer, Heidelberg (2013) CrossRef Gargiulo, F., Mazzariello, C., Sansone, C.: Multiple classifier systems: theory, applications and tools. In: Bianchini, M., Maggini, M., Jain, L.C. (eds.) Handbook on Neural Information Processing. ISRL, vol. 49, pp. 335–378. Springer, Heidelberg (2013) CrossRef
Metadaten
Titel
A Method for Topic Detection in Great Volumes of Data
verfasst von
Flora Amato
Francesco Gargiulo
Alessandro Maisto
Antonino Mazzeo
Serena Pelosi
Carlo Sansone
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25936-9_11