Skip to main content

2017 | OriginalPaper | Buchkapitel

Integrating LDA with Clustering Technique for Relevance Feature Selection

verfasst von : Abdullah Semran Alharbi, Yuefeng Li, Yue Xu

Erschienen in: AI 2017: Advances in Artificial Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Selecting features from documents that describe user information needs is challenging due to the nature of text, where redundancy, synonymy, polysemy, noise and high dimensionality are common problems. The assumption that clustered documents describe only one topic can be too simple knowing that most long documents discuss multiple topics. LDA-based models show significant improvement over the cluster-based in information retrieval (IR). However, the integration of both techniques for feature selection (FS) is still limited. In this paper, we propose an innovative and effective cluster- and LDA-based model for relevance FS. The model also integrates a new extended random set theory to generalise the LDA local weights for document terms. It can assign a more discriminative weight to terms based on their appearance in LDA topics and the clustered documents. The experimental results, based on the RCV1 dataset and TREC topics for information filtering (IF), show that our model significantly outperforms eight state-of-the-art baseline models in five standard performance measures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
In this paper, terms, words, keywords or unigrams are used interchangeably.
 
2
We will refer to the proposed model from now on as CBTM-ERS, a Cluster-Based Topic Model using Extended Random Set.
 
Literatur
1.
Zurück zum Zitat Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)CrossRef Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)CrossRef
2.
Zurück zum Zitat Albathan, M., Li, Y., Algarni, A.: Enhanced N-gram extraction using relevance feature discovery. In: Cranefield, S., Nayak, A. (eds.) AI 2013. LNCS, vol. 8272, pp. 453–465. Springer, Cham (2013). doi:10.1007/978-3-319-03680-9_46 CrossRef Albathan, M., Li, Y., Algarni, A.: Enhanced N-gram extraction using relevance feature discovery. In: Cranefield, S., Nayak, A. (eds.) AI 2013. LNCS, vol. 8272, pp. 453–465. Springer, Cham (2013). doi:10.​1007/​978-3-319-03680-9_​46 CrossRef
3.
Zurück zum Zitat Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014) Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014)
4.
Zurück zum Zitat Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002, pp. 436–442. ACM (2002) Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002, pp. 436–442. ACM (2002)
5.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
6.
Zurück zum Zitat Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000) Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000)
7.
Zurück zum Zitat Chao, S., Cai, J., Yang, S., Wang, S.: A clustering based feature selection method using feature information distance for text data. In: Huang, D.-S., Bevilacqua, V., Premaratne, P. (eds.) ICIC 2016. LNCS, vol. 9771, pp. 122–132. Springer, Cham (2016). doi:10.1007/978-3-319-42291-6_12 CrossRef Chao, S., Cai, J., Yang, S., Wang, S.: A clustering based feature selection method using feature information distance for text data. In: Huang, D.-S., Bevilacqua, V., Premaratne, P. (eds.) ICIC 2016. LNCS, vol. 9771, pp. 122–132. Springer, Cham (2016). doi:10.​1007/​978-3-319-42291-6_​12 CrossRef
8.
Zurück zum Zitat Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 38(1), 218–237 (2008)CrossRef Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 38(1), 218–237 (2008)CrossRef
9.
Zurück zum Zitat Ferreira, C.H., de Medeiros, D.M., Santana, F.: Fcfilter: feature selection based on clustering and genetic algorithms. In: CEC 2016, pp. 2106–2113. IEEE (2016) Ferreira, C.H., de Medeiros, D.M., Santana, F.: Fcfilter: feature selection based on clustering and genetic algorithms. In: CEC 2016, pp. 2106–2113. IEEE (2016)
10.
Zurück zum Zitat Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015) Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015)
11.
Zurück zum Zitat Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefMATH Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefMATH
12.
Zurück zum Zitat Huang, A.: Similarity measures for text document clustering. In: NZCSRSC 2008, pp. 49–56 (2008) Huang, A.: Similarity measures for text document clustering. In: NZCSRSC 2008, pp. 49–56 (2008)
13.
Zurück zum Zitat Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRef Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRef
14.
Zurück zum Zitat Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Inf. Retr. 14(6), 593 (2011)CrossRefMATH Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Inf. Retr. 14(6), 593 (2011)CrossRefMATH
15.
Zurück zum Zitat Kruse, R., Schwecke, E., Heinsohn, J.: Uncertainty and Vagueness in Knowledge Based Systems: Numerical Methods. Springer Science & Business Media, Heidelberg (2012)MATH Kruse, R., Schwecke, E., Heinsohn, J.: Uncertainty and Vagueness in Knowledge Based Systems: Numerical Methods. Springer Science & Business Media, Heidelberg (2012)MATH
16.
Zurück zum Zitat Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)CrossRef Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)CrossRef
17.
Zurück zum Zitat Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003). doi:10.1007/3-540-39205-X_87 CrossRef Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003). doi:10.​1007/​3-540-39205-X_​87 CrossRef
18.
Zurück zum Zitat Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015) Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015)
19.
Zurück zum Zitat Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193. ACM (2004) Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193. ACM (2004)
20.
Zurück zum Zitat Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-Gram Workshop, p. 30. Citeseer (2010) Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-Gram Workshop, p. 30. Citeseer (2010)
21.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH
22.
Zurück zum Zitat Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013) Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013)
23.
Zurück zum Zitat McCallum, A.K.: Mallet: A machine learning for language toolkit (2002) McCallum, A.K.: Mallet: A machine learning for language toolkit (2002)
24.
Zurück zum Zitat Molchanov, I.: Theory of Random Sets. Springer Science & Business Media, London (2006) Molchanov, I.: Theory of Random Sets. Springer Science & Business Media, London (2006)
25.
Zurück zum Zitat Rasmussen, M., Karypis, G.: gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS TR-04 21(7) (2004) Rasmussen, M., Karypis, G.: gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS TR-04 21(7) (2004)
26.
Zurück zum Zitat Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009) Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
27.
Zurück zum Zitat Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002) Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)
28.
Zurück zum Zitat Savaresi, S.M., Boley, D.L.: On the performance of bisecting k-means and PDDP. In: ICDM 2001, pp. 1–14. SIAM (2001) Savaresi, S.M., Boley, D.L.: On the performance of bisecting k-means and PDDP. In: ICDM 2001, pp. 1–14. SIAM (2001)
29.
Zurück zum Zitat Soboroff, I., Robertson, S.: Building a filtering test collection for TREC 2002. In: SIGIR 2003, pp. 243–250. ACM (2003) Soboroff, I., Robertson, S.: Building a filtering test collection for TREC 2002. In: SIGIR 2003, pp. 243–250. ACM (2003)
30.
Zurück zum Zitat Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, Boston, pp. 525–526 (2000) Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, Boston, pp. 525–526 (2000)
31.
Zurück zum Zitat Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007) Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
32.
Zurück zum Zitat Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Data Clustering: Algorithms and Applications, p. 305. CRC Press (2013) Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Data Clustering: Algorithms and Applications, p. 305. CRC Press (2013)
33.
Zurück zum Zitat Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: ISCIS 2009, pp. 230–235. IEEE (2009) Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: ISCIS 2009, pp. 230–235. IEEE (2009)
34.
Zurück zum Zitat Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007) Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007)
35.
Zurück zum Zitat Wu, Q., Ye, Y., Ng, M., Su, H., Huang, J.: Exploiting word cluster information for unsupervised feature selection. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS (LNAI), vol. 6230, pp. 292–303. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15246-7_28 CrossRef Wu, Q., Ye, Y., Ng, M., Su, H., Huang, J.: Exploiting word cluster information for unsupervised feature selection. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS (LNAI), vol. 6230, pp. 292–303. Springer, Heidelberg (2010). doi:10.​1007/​978-3-642-15246-7_​28 CrossRef
36.
Zurück zum Zitat Zhang, Z., Phan, X.H., Horiguchi, S.: An efficient feature selection using hidden topic in text categorization. In: AINAW 2008, pp. 1223–1228. IEEE (2008) Zhang, Z., Phan, X.H., Horiguchi, S.: An efficient feature selection using hidden topic in text categorization. In: AINAW 2008, pp. 1223–1228. IEEE (2008)
37.
Zurück zum Zitat Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012) Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012)
Metadaten
Titel
Integrating LDA with Clustering Technique for Relevance Feature Selection
verfasst von
Abdullah Semran Alharbi
Yuefeng Li
Yue Xu
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-63004-5_22

Premium Partner