nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models

verfasst von : Gloria Virginia, Hung Son Nguyen

Erschienen in: Transactions on Rough Sets XIX

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The research of Tolerance Rough Sets Model (TRSM) ever conducted acted in accordance with the rational approach of AI perspective. This article presented studies who complied with the contrary path, i.e. a cognitive approach, for an objective of a modular framework of semantic text retrieval system based on TRSM specifically for Indonesian. In addition to the proposed framework, this article proposes three methods based on TRSM, which are the automatic tolerance value generator, thesaurus optimization, and lexicon-based document representation. All methods were developed by the use of our own corpus, namely ICL-corpus, and evaluated by employing an available Indonesian corpus, called Kompas-corpus. The endeavor of a semantic information retrieval system is the effort to retrieve information and not merely terms with similar meaning. This article is a baby step toward the objective.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Interface of Rough Set Systems and Modal Logics: A Survey

Nächstes Kapitel Some Transportation Problems Under Uncertain Environments

Nur mit Berechtigung zugänglich

Key statistical highlights: ITU data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.

Utterances may include sound, marks, gesture, grunts, and groans (anything that can signal an intention).

The reason is, in the context of speech act, we do not concern about whether the belief of a speaker is true or not, rather we concern about the intention of speaker what he/she wants to represent by his/her utterance. Thus, it might be the case that a speaker represents his/her false belief as a true belief to the audience, e.g. a speaker utters ‘it is raining’, while in fact ‘it is a sunny day’.

In other words, ‘the mind to fit the world’. It is because a belief is like a statement, can be true or false; if the statement is false then it is the fault of the statement, not the world. The world-to-mind direction of fit is applied for the psychological mode such as desire or promise; if the promise is broken, it is the fault of the promiser.

BPS-Statistics Indonesia. URL: http://www.bps.go.id/. Accessed on 25 October 2012.

July 2012 estimation of The World Factbook. URL: https://www.cia.gov. Accessed on 25 October 2012.

Portal Nasional Indonesia (National Portal of Indonesia). URL: http://www.indonesia.go.id. Accessed on 25 October 2012.

Key statistical highlights: International Telecommunication Union (ITU) data release June 2012. URL: http://www.itu.int. Accessed on 25 October 2012.

URL: http://www.internetworldstats.com. Accessed on 25 October 2012.

The graph was taken from the International Telecommunication Union (ITU). URL: http://www.itu.int/ITU-D/ict/statistics/explorer/index.html. Accessed on 25 October 2012.

Appendix A provides an explanation about the TF*IDF weighting scheme.

The cognitive modeling is an approach employed in the Cognitive Science (CS). Cognitive science is an interdisciplinary study of mental representations and computations and of the physical systems that support those processes [18, p. xv].

Explanation about all corpora used in this article is available in Appendix C.

TREC is a forum for IR community which provides an infrastructure necessary to evaluate an IR system on a broad range of problems. URL: http://trec.nist.gov/.

Appendix B provides explanation about Cosine similarity measure as a document ranking algorithm.

Consistent with VSM, GVSM interprets index term vectors as linearly independent, however they are not orthogonal.

ICL-corpus consists of 1,000 documents taken from an Indonesian choral mailing list, while WORDS-corpus consists of 1,000 documents created from ICL-corpus in an annotation process conducted by human experts. Further explanation of these corpora is available in Appendix C.1.

We collaborated with 3 choral experts during annotation process. Their backgrounds could be reviewed in Appendix C.3.

We used CS stemmer and Vega’s stopword in all of our studies presented in this article.

Please see Appendix C.1 for explanation of annotation process.

Please see Appendix C.1.

If the size of tolerance classes are smaller then the size of upper sets will be smaller, and vice versa.

These values are for the process with stemming task.

Most of the foreign terms was English.

It comes from an English term workshop and an Indonesian suffix -nya.

Inverted index was applied for document representations in all experiments in this article.

It is an open source project implemented in Java licensed under the liberal Apache Software License [40]. We used Lucene 3.1.0 in our study. URL for download: http://lucene.apache.org/core/downloads.html.

JAMA has been developed by the MathWorks and NIST. It provides user-level classes for constructing and manipulating real, dense matrices. We used JAMA 1.0.2 in this study. URL: http://math.nist.gov/javanumerics/jama/.

We used the trec_eval.9.0 which is publicly available on http://trec.nist.gov/trec_eval/.

WORDS-corpus is generated based on ICL-corpus hence they dwell in a single domain.

Base method means that we employed the TF*IDF weighting scheme only without TRSM implementation.

Please see Appendix C.2.

Explanation about Cosine as a document ranking is available in Appendix B.

In fact, we found the same result between ICL_1000 and ICL_1000 + WORDS_1000 in all calculations we made, such as in R-Precision, Precision@10, Precision@20, and Precision@30.

It is an Indonesian lexicon created by the University of Indonesia described in a study of Nazief and Adriani in 1996 [43] which consists of 29,337 Indonesian root words. The lexicon has been used in other studies [10, 38].

KBBI is a dictionary copyrighted by Pusat Bahasa (in English: Language Center), Indonesian Ministry of Education, which consists of 27,828 root words.

The index terms of thesaurus are in the form of single term, hence we choose term partitur as the representative of the karya musik concept.

Figure 35 serves as a basis for the choice of \(\theta \) values in which the TRSM-representation, LEX-representation, TRSM-representation, and TFIDF-representation outperform the other representations at \(\theta \) = 2, \(\theta \) = 8, \(\theta \) = 41, and \(\theta \) = 88 in respective order. However, particularly at \(\theta \) = 88, the TFIDF-representation only performs better than the LEX-representation.

The base model means that we employed the TF*IDF weighting scheme without TRSM implementation nor the mapping process.

Kompas-corpus is a TREC-like Indonesian testbed which is composed of 3,000 newswire articles and is accompanied by 20 topics. Please see Appendix C.4 for more explanation.

Big data is a term to describe the enormity of data, both structured and unstructured, in volume, velocity, and variety [45].

Please see Appendix C.4 for more explanation about Kompas-corpus.

Indonesian Wikipedia: http://id.wikipedia.org/wiki/Halaman_Utama.

DBpedia is a community project which was started and is administered by research group from Universität Leipzig, Freie Universität Berlin, and OpenLink Software. The project is an effort to extract information from Wikipedia, make this information available on the Web under an open license, and interlink the DBpedia dataset with other open datasets on the Web. The Indonesian short abstracts of DBpedia was downloaded from http://downloads.dbpedia.org/3.7/id/.

Kompas. URL: http://www.kompas.com.

Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval: Implementing and Evaluating Search Engine. MIT Press, Cambridge (2010)

Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining - Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)

Eifring, H., Theil, R.: Linguistics for Students of Asian and African Languages (2005)

Grandy, R.E., Warner, R.: Paul grice. http://plato.stanford.edu/entries/grice/, May 2006. Accessed 02 Oct 2012

Searle, J.R.: Intentionality: An Essay in the Philosophy of Mind. Cambridge University Press, Cambridge (1983)CrossRef

Grice, H.P.: Studies in the Way of Words. Harvard University Press, Cambridge (1989)

Haugh, M., Jaszczolt, K.M.: Speaker intentions and intentionality. In: Allan, K., Jaszczolt, K.M. (eds.) The Cambridge Handbook of Pragmatics, pp. 87–112. Cambridge University Press, Cambridge (2012)CrossRef

Akand, M.: Grice and searle on meaning. Copula - J. Philos. Dept XXVIII, 51–58 (2011)

Adriani, M., Manurung, R.: A survey of bahasa Indonesia NLP research conducted at the University of Indonesia. In: Proceedings of the 2nd International MALINDO Workshop (2008)

10.

Asian, J.: Effective techniques for Indonesian text retrieval. Ph.D. thesis, School of Computer Science and Information Technology, RMIT University, Doctor of Philosophy Thesis (March 2007)

11.

Asian, J., Williams, H.E., Tahaghoghi, S.M.M.: A testbed for Indonesian text retrieval. In: Bruza, P., Moffat, A., Turpin, A. (eds.) ADCS, pp. 55–58. University of Melbourne, Department of Computer Science (2004)

12.

Sneddon, J.: The Indonesian Language: It’s History and Role in Modern Society. UNSW Press, Sydney (2003)

13.

Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000) CrossRef

14.

Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17(2), 199–212 (2002)CrossRef

15.

Nguyen, H.S., Ho, T.B.: Rough document clustering and the internet. In: Handbook of Granular Computing, pp. 987–1003. Wiley, Hoboken (2008)

16.

Wu, Y., Ding, Y., Wang, X., Xu, J.: On-line hot topic recommendation using tolerance rough set based topic clustering. J. Comput. 5, 549–556 (2010)

17.

Gaoxiang, Y., Heping, H., Zhengding, L., Ruixuan, L.: A novel web query automatic expansion based on rough set. Wuhan Univ. J. Nat. Sci. 11(5), 1167–1171 (2006)CrossRef

18.

Bly, B.M., Rumelhart, D.E. (eds.): Cognitive Science: Handbook of Perception and Cognition, 2nd edn. Academic Press, Millbrae (1999)

19.

Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Pearson Education Inc., Upper Saddle River (2010)

20.

Voorhees, E.M., Harman, D.: Overview of the ninth text retrieval conference (TREC-9). In: Proceedings of the Ninth Text Retrieval Conference (TREC-9), National Institute of Standards and Technology (NIST), pp. 1–14 (2000)

21.

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)

22.

Chomsky, N.: Language and Mind, 3rd edn. Cambridge University Press, New York (2006)CrossRef

23.

Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1988, New York, NY, USA, pp. 465–480. ACM (1988)

24.

Grossman, D.A., Frieder, O.: Information Retrieval: Algorithms and Heuristics, 2nd edn. Springer, Netherlands (2004)CrossRef

25.

Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial intelligence. IJCAI 2007, San Francisco, CA, USA, pp. 1606–1611. Morgan Kaufmann Publishers Inc (2007)

26.

Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, New York, NY, USA, pp. 1961–1964. ACM (2011)

27.

Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1985, New York, NY, USA, pp. 18–25. ACM (1985)

28.

Nguyen, S.H., Świeboda, W., Jaśkiewicz, G.: Extended document representation for search result clustering. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 77–95. Springer, Heidelberg (2012) CrossRef

29.

Nguyen, S.H., Jaśkiewicz, G., Świeboda, W., Nguyen, H.S.: Enhancing search result clustering with semantic indexing. In: Proceedings of the Third Symposium on Information and Communication Technology. SoICT 2012, New York, NY, USA, pp. 71–80. ACM (2012)

30.

Szczuka, M., Janusz, A., Herba, K.: Semantic clustering of scientific articles with use of DBpedia knowledge base. In: Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.) Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390, pp. 61–76. Springer, Heidelberg (2012) CrossRef

31.

Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11(5), 341–356 (1982)MathSciNetCrossRef

32.

Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough Sets: A Tutorial, pp. 3–98. Springer, Singapore (1998)

33.

Pawlak, Z.: Some issues on rough sets. In: Peters, J.F., Skowron, A., Grzymała-Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 1–58. Springer, Heidelberg (2004) CrossRef

34.

Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27, 245–253 (1996)MathSciNet

35.

Lassila, O., Mcguinness, D.: The role of frame-based representation on the semantic web. Technical report, Knowledge System Laboratory, Standford University (2001)

36.

Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundamenta Informaticae 124, 27–45 (2013, to appear)

37.

Vega, V.B.: Information retrieval for the Indonesian language. Master’s thesis, National University of Singapore, Unpublished (2001)

38.

Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S.M.M., Williams, H.E.: Stemming indonesian: a confix-stripping approach. ACM Trans. Asian Lang. Inf. Process. 6, 1–33 (2007)CrossRef

39.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRef

40.

McCandless, M., Hatcher, E., Gospodnetić, O.: Lucene in Action. Manning Publications Co., Greenwich (2010)

41.

Virginia, G., Nguyen, H.S.: An algorithm for tolerance value generator in tolerance rough sets model. In: Na, M.G., Toro, C., Posada, J., Howlett, R.J., Jain, L.C. (eds.) Advances in Knowledge-Based and Intelligent Information and Engineering Systems. KES 2012, Netherlands, pp. 595–604. IOS Press (2012)

42.

Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)

43.

Adriani, M., Nazief, B.: Confix-Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal Publication, Depok (1996)

44.

Obadi, G., Dráždilová, P., Hlaváček, L., Martinovič, J., Snášel, V.: A tolerance rough set based overlapping clustering for the DBLP data. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and International Conference on Intelligent Agent Technology - Workshops. WI-IAT 2010, vol. 3, pp. 57–60. IEEE (2010)

45.

Troester, M.: Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf (2012). SAS Institute Inc. Accessed 22 Feb 2013

46.

Ingwersen, P.: Information Retrieval Interaction, 1st edn. Taylor Graham, London (1992)

47.

Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef

48.

Manola, F., Miller, E.: Rdf primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ (2004). W3C. Accessed 12 Jan 2013

Titel: A Semantic Text Retrieval for Indonesian Using Tolerance Rough Sets Models
verfasst von: Gloria Virginia
Hung Son Nguyen
Verlag: Springer Berlin Heidelberg
Buch: Transactions on Rough Sets XIX
Print ISBN: 978-3-662-47814-1

Electronic ISBN: 978-3-662-47815-8

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-3-662-47815-8_9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"