Skip to main content

2017 | OriginalPaper | Buchkapitel

An Evolutionary-Based Term Reduction Approach to Bilingual Clustering of Malay-English Corpora

verfasst von : Rayner Alfred, Leow Ching Leong, Joe Henry Obit

Erschienen in: Advances in Information and Communication Technology

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual clustering would not be a problem. However clustering bilingual document is still facing the same problem faced by a monolingual document clustering which is the “curse of dimensionality”. Hence, this encourages the study of term reduction technique in clustering bilingual documents. The objective in this study is to study the effects of reducing terms considered in clustering bilingual corpus in parallel for English and Malay documents. In this study, a genetic algorithm (GA) is used in order to reduce the number of feature selected. A single-point crossover with a crossover rate of 0.8 is used. Not only that, this study also assesses the effects of applying different mutation rate (e.g., 0.1 and 0.01) in selecting the number of features used in clustering bilingual documents. The result shows that the implementation of GA does improve the clustering mapping compared to the initial clustering mapping. Not only that, this study also discovers that GA with a mutation rate of 0.01 produces the best parallel clustering mapping results compared to GA with a mutation rate of 0.1.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Tang, B., Shepherd, M., Heywood, M.I., Luo, X.: Comparing dimension reduction techniques for document clustering. In: Kégl, B., Lapalme, G. (eds.) AI 2005. LNCS (LNAI), vol. 3501, pp. 292–296. Springer, Heidelberg (2005). doi:10.1007/11424918_30 CrossRef Tang, B., Shepherd, M., Heywood, M.I., Luo, X.: Comparing dimension reduction techniques for document clustering. In: Kégl, B., Lapalme, G. (eds.) AI 2005. LNCS (LNAI), vol. 3501, pp. 292–296. Springer, Heidelberg (2005). doi:10.​1007/​11424918_​30 CrossRef
2.
Zurück zum Zitat Alfred, R., Chan, C.J., Tahir, A., Obit, J.H.: Optimizing clusters alignment for bilingual Malay-English Corpora. J. Comput. Sci. 8(12), 1970–1978 (2012)CrossRef Alfred, R., Chan, C.J., Tahir, A., Obit, J.H.: Optimizing clusters alignment for bilingual Malay-English Corpora. J. Comput. Sci. 8(12), 1970–1978 (2012)CrossRef
3.
Zurück zum Zitat Alfred, R., Paskaleva, E., Kazakov, D., Bartlett, M.: Hierarchical agglomerative clustering of English-Bulgarian parallel Corpora. In: Proceedings of International Conference of Recent Advances in Natural Languages Processing (2007) Alfred, R., Paskaleva, E., Kazakov, D., Bartlett, M.: Hierarchical agglomerative clustering of English-Bulgarian parallel Corpora. In: Proceedings of International Conference of Recent Advances in Natural Languages Processing (2007)
4.
Zurück zum Zitat Micheal, W.B., Susan, T.D., Gavin, W.O.B.: Using linear algebra for intelligent information retrieval. SIAM Rev. 37(4), 573–595 (1995)MathSciNetCrossRefMATH Micheal, W.B., Susan, T.D., Gavin, W.O.B.: Using linear algebra for intelligent information retrieval. SIAM Rev. 37(4), 573–595 (1995)MathSciNetCrossRefMATH
5.
Zurück zum Zitat Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250 (2001) Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250 (2001)
6.
Zurück zum Zitat Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Computer Science. Virginia Tech (2007) Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Computer Science. Virginia Tech (2007)
7.
Zurück zum Zitat Xu, W., Liu, X., Gong, Y.H.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 267–273 (2003) Xu, W., Liu, X., Gong, Y.H.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 267–273 (2003)
8.
Zurück zum Zitat Hyvarinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4), 411–430 (2000)CrossRef Hyvarinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4), 411–430 (2000)CrossRef
9.
Zurück zum Zitat Holland, J.H.: Adaptation in Natural and Artificial Systems. MIT Press, Cambridge (1992) Holland, J.H.: Adaptation in Natural and Artificial Systems. MIT Press, Cambridge (1992)
10.
Zurück zum Zitat Hussein, F., Ward, R., Kharma, N.: Genetic algorithms for feature selection and weighting, a review and study. In: 12th International Conference on Document Analysis and Recognition, p. 1240 (2001) Hussein, F., Ward, R., Kharma, N.: Genetic algorithms for feature selection and weighting, a review and study. In: 12th International Conference on Document Analysis and Recognition, p. 1240 (2001)
11.
Zurück zum Zitat Alfred, R., Paskaleva, E., Kazakov, D., Bartlett, M.: Hierarchical agglomerative clustering for cross-language information retrieval. Int. J. Transl. 19(1), 1–25 (2007) Alfred, R., Paskaleva, E., Kazakov, D., Bartlett, M.: Hierarchical agglomerative clustering for cross-language information retrieval. Int. J. Transl. 19(1), 1–25 (2007)
12.
Zurück zum Zitat Kumar, A.A., Chandrasekhar, S.: Text data pre-processing and dimensionality reduction techniques for document clustering. Int. J. Eng. Technol. (IJERT) 1(5), 1–6 (2012) Kumar, A.A., Chandrasekhar, S.: Text data pre-processing and dimensionality reduction techniques for document clustering. Int. J. Eng. Technol. (IJERT) 1(5), 1–6 (2012)
13.
Zurück zum Zitat El-Khiar, I.: Effects of stops words elimination for Arabic information retrieval: a comparative study. Int. J. Comput. Inf. Sci. 4(3), 119–133 (2006) El-Khiar, I.: Effects of stops words elimination for Arabic information retrieval: a comparative study. Int. J. Comput. Inf. Sci. 4(3), 119–133 (2006)
14.
Zurück zum Zitat Porter, M.: An algorithm for suffix stripping. Program (Autom. Libr. Inf. Syst.) 14(3), 130–137 (2006) Porter, M.: An algorithm for suffix stripping. Program (Autom. Libr. Inf. Syst.) 14(3), 130–137 (2006)
15.
Zurück zum Zitat Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay stemming algorithm with background knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS (LNAI), vol. 7458, pp. 753–758. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32695-0_68 CrossRef Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay stemming algorithm with background knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS (LNAI), vol. 7458, pp. 753–758. Springer, Heidelberg (2012). doi:10.​1007/​978-3-642-32695-0_​68 CrossRef
16.
Zurück zum Zitat Montolva, S., Martinez, R., Casillas, A., Fresno, V.: Multilingual document clustering: an heuristic approach based on cognate named entities. In: Proceedings of COLING-ACL, pp. 1145–1152 (2006) Montolva, S., Martinez, R., Casillas, A., Fresno, V.: Multilingual document clustering: an heuristic approach based on cognate named entities. In: Proceedings of COLING-ACL, pp. 1145–1152 (2006)
17.
Zurück zum Zitat Montalvo, S., Fresno, V., Martinez, R.: NESM: a named entity based proximity measure for multilingual news clustering. Procesamiento del Lenguaje Nat. 48, 81–88 (2012) Montalvo, S., Fresno, V., Martinez, R.: NESM: a named entity based proximity measure for multilingual news clustering. Procesamiento del Lenguaje Nat. 48, 81–88 (2012)
18.
Zurück zum Zitat Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370 (2005) Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370 (2005)
19.
Zurück zum Zitat Atdag, S., Labatut, V.: A comparison of named entity recognition tools applied to biographical texts. In: 2nd International Conference on Systems and Computer Science (ICSCS), pp. 228–233 (2013) Atdag, S., Labatut, V.: A comparison of named entity recognition tools applied to biographical texts. In: 2nd International Conference on Systems and Computer Science (ICSCS), pp. 228–233 (2013)
20.
Zurück zum Zitat Alfred, R., Leow, C.L., Chin, K.O., Anthony, P.: Malay named entity recognition based on rule-based approach. IJMLC 3(4), 300–306 (2014)CrossRef Alfred, R., Leow, C.L., Chin, K.O., Anthony, P.: Malay named entity recognition based on rule-based approach. IJMLC 3(4), 300–306 (2014)CrossRef
21.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schutze, H.: An Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2009)MATH Manning, C.D., Raghavan, P., Schutze, H.: An Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2009)MATH
22.
Zurück zum Zitat Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference, NZCSRSC, pp. 49–56 (2008) Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference, NZCSRSC, pp. 49–56 (2008)
Metadaten
Titel
An Evolutionary-Based Term Reduction Approach to Bilingual Clustering of Malay-English Corpora
verfasst von
Rayner Alfred
Leow Ching Leong
Joe Henry Obit
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-49073-1_16

Premium Partner