nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Integrating LDA with Clustering Technique for Relevance Feature Selection

verfasst von : Abdullah Semran Alharbi, Yuefeng Li, Yue Xu

Erschienen in: AI 2017: Advances in Artificial Intelligence

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Selecting features from documents that describe user information needs is challenging due to the nature of text, where redundancy, synonymy, polysemy, noise and high dimensionality are common problems. The assumption that clustered documents describe only one topic can be too simple knowing that most long documents discuss multiple topics. LDA-based models show significant improvement over the cluster-based in information retrieval (IR). However, the integration of both techniques for feature selection (FS) is still limited. In this paper, we propose an innovative and effective cluster- and LDA-based model for relevance FS. The model also integrates a new extended random set theory to generalise the LDA local weights for document terms. It can assign a more discriminative weight to terms based on their appearance in LDA topics and the clustered documents. The experimental results, based on the RCV1 dataset and TREC topics for information filtering (IF), show that our model significantly outperforms eight state-of-the-art baseline models in five standard performance measures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Textual Emotion Classification: An Interoperability Study on Cross-Genre Data Sets

Nächstes Kapitel Non Sub-sampled Contourlet Transform Based Feature Extraction Technique for Differentiating Glioma Grades Using MRI Images

In this paper, terms, words, keywords or unigrams are used interchangeably.

We will refer to the proposed model from now on as CBTM-ERS, a Cluster-Based Topic Model using Extended Random Set.

http://trec.nist.gov/.

https://www.lemurproject.org/.

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)CrossRef

Albathan, M., Li, Y., Algarni, A.: Enhanced N-gram extraction using relevance feature discovery. In: Cranefield, S., Nayak, A. (eds.) AI 2013. LNCS, vol. 8272, pp. 453–465. Springer, Cham (2013). doi:10.1007/978-3-319-03680-9_46 CrossRef

Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014)

Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002, pp. 436–442. ACM (2002)

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH

Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000)

Chao, S., Cai, J., Yang, S., Wang, S.: A clustering based feature selection method using feature information distance for text data. In: Huang, D.-S., Bevilacqua, V., Premaratne, P. (eds.) ICIC 2016. LNCS, vol. 9771, pp. 122–132. Springer, Cham (2016). doi:10.1007/978-3-319-42291-6_12 CrossRef

Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 38(1), 218–237 (2008)CrossRef

Ferreira, C.H., de Medeiros, D.M., Santana, F.: Fcfilter: feature selection based on clustering and genetic algorithms. In: CEC 2016, pp. 2106–2113. IEEE (2016)

10.

Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015)

11.

Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefMATH

12.

Huang, A.: Similarity measures for text document clustering. In: NZCSRSC 2008, pp. 49–56 (2008)

13.

Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRef

14.

Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Inf. Retr. 14(6), 593 (2011)CrossRefMATH

15.

Kruse, R., Schwecke, E., Heinsohn, J.: Uncertainty and Vagueness in Knowledge Based Systems: Numerical Methods. Springer Science & Business Media, Heidelberg (2012)MATH

16.

Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)CrossRef

17.

Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003). doi:10.1007/3-540-39205-X_87 CrossRef

18.

Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015)

19.

Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193. ACM (2004)

20.

Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-Gram Workshop, p. 30. Citeseer (2010)

21.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH

22.

Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013)

23.

McCallum, A.K.: Mallet: A machine learning for language toolkit (2002)

24.

Molchanov, I.: Theory of Random Sets. Springer Science & Business Media, London (2006)

25.

Rasmussen, M., Karypis, G.: gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS TR-04 21(7) (2004)

26.

Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)

27.

Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)

28.

Savaresi, S.M., Boley, D.L.: On the performance of bisecting k-means and PDDP. In: ICDM 2001, pp. 1–14. SIAM (2001)

29.

Soboroff, I., Robertson, S.: Building a filtering test collection for TREC 2002. In: SIGIR 2003, pp. 243–250. ACM (2003)

30.

Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, Boston, pp. 525–526 (2000)

31.

Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)

32.

Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Data Clustering: Algorithms and Applications, p. 305. CRC Press (2013)

33.

Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: ISCIS 2009, pp. 230–235. IEEE (2009)

34.

Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007)

35.

Wu, Q., Ye, Y., Ng, M., Su, H., Huang, J.: Exploiting word cluster information for unsupervised feature selection. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS (LNAI), vol. 6230, pp. 292–303. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15246-7_28 CrossRef

36.

Zhang, Z., Phan, X.H., Horiguchi, S.: An efficient feature selection using hidden topic in text categorization. In: AINAW 2008, pp. 1223–1228. IEEE (2008)

37.

Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012)

Titel: Integrating LDA with Clustering Technique for Relevance Feature Selection
verfasst von: Abdullah Semran Alharbi
Yuefeng Li
Yue Xu
Verlag: Springer International Publishing
Buch: AI 2017: Advances in Artificial Intelligence
Print ISBN: 978-3-319-63003-8

Electronic ISBN: 978-3-319-63004-5

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-63004-5_22

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner