nach oben

Erschienen in:

2014 | OriginalPaper | Buchkapitel

Semantic Compression for Text Document Processing

verfasst von : Dariusz Ceglarek

Erschienen in: Transactions on Computational Collective Intelligence XIV

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Ongoing research on novel methods and tools that can be applied in Natural Language Processing tasks has resulted in the design of a semantic compression mechanism. Semantic compression is a technique that allows for correct generalization of terms in some given context. Thanks to this generalization a common thought can be detected. The rules governing the generalization process are based on a data structure which is referred to as a domain frequency dictionary. Having established the domain for a given text fragment the disambiguation of possibly many hypernyms becomes a feasible task. Semantic compression, thus an informed generalization, is possible through the use of semantic networks as a knowledge representation structure. In the given overview, it is worth noting that the semantic compression allows for a number of improvements in comparison to already established Natural Language Processing techniques. These improvements, along with a detailed discussion of the various elements of algorithms and data structures that are necessary to make semantic compression a viable solution, are the core of this work. Semantic compression can be applied in a variety of scenarios, e.g. in detection of plagiarism. With increasing effort being spent on developing semantic compression, new domains of application have been discovered. What is more, semantic compression itself has evolved and has been refined by the introduction of new solutions that boost the level of disambiguation efficiency. Thanks to the remodeling of already existing data sources to suit algorithms enabling semantic compression, it has become possible to use semantic compression as a base for automata that, thanks to the exploration of hypernym-hyponym and synonym relations, new concepts that may be included in the knowledge representation structures can now be discovered.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A Two-Armed Bandit Collective for Hierarchical Examplar Based Mining of Frequent Itemsets with Applications to Intrusion Detection

Nächstes Kapitel On Stigmergically Controlling a Population of Heterogeneous Mobile Agents Using Cloning Resource

Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston (1999)

Boyd-Graber, J., Blei, D.M., Zhu, X.: A topic model for word sense disambiguation. In: EMNLP (2007)

Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw.: Pract. Exper. 37(2), 151–175 (2007)

Ceglarek, D., Haniewicz, K., Rutkowski, W.: Quality of semantic compression in classification. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.) ICCCI 2010, Part I. LNCS, vol. 6421, pp. 162–171. Springer, Heidelberg (2010)

Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic compression for specialised information retrieval systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010) CrossRef

Ceglarek, D., Haniewicz, K., Rutkowski, W.: Domain based semantic compression for automatic text comprehension augmentation and recommendation. In: Jędrzejowicz, P., Nguyen, N.T., Hoang, K. (eds.) ICCCI 2011, Part II. LNCS, vol. 6923, pp. 40–49. Springer, Heidelberg (2011)

Ceglarek, D., Haniewicz, K., Rutkowski, W.: Towards knowledge acquisition with WiSENet. In: Nguyen, N.T., Trawiński, B., Jung, J.J. (eds.) New Challenges for Intelligent Information and Database Systems. SCI, vol. 351, pp. 75–84. Springer, Heidelberg (2011) CrossRef

Erk, K., Padó, S.: A structured vector space model for word meaning in context. In: EMNLP, pp. 897–906. ACL (2008)

Frakes, W.B., Baeza-Yates, R.A. (eds.): Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Upper Saddle River (1992)

10.

Hotho, A., Staab, S., Stumme, G.: Explaining text clustering results using semantic structures. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 217–228. Springer, Heidelberg (2003)

11.

Khan, L., McLeod, D., Hovy, E.: Retrieval effectiveness of an ontology-based model for information selection. VLDB J. 13, 71–85 (2004)CrossRef

12.

Krovetz, R., Croft, W.B.: Lexical ambiguity and information retrieval. ACM Trans. Inf. Syst. 10, 115–141 (1992)CrossRef

13.

Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: Proceedings of the 2007 International Conference on Computer Systems and Technologies, CompSysTech ’07, New York, NY, USA, pp. 40:1–40:6. ACM (2007)

14.

Mikowski, M.: Automated building of error corpora of polish. In: Lewandowska-Tomaszczyk, B. (ed.) Corpus Linguistics, Computer Tools, and Applications State of the Art, PALC 2007, pp. 631–639. Peter Lang, Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, (2008)

15.

Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38, 39–41 (1995)CrossRef

16.

Nock, R., Nielsen, F.: On weighting clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1223–1235 (2006)CrossRef

17.

Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, IUCS ’09, pp. 395–399, New York, NY, USA. ACM (2009)

18.

Sanderson, M.: Word sense disambiguation and information retrieval. In: Croft, W.B., van Rijsbergen, C.J. (eds.) SIGIR ’94, pp. 142–151. ACM/Springer, London (1994)

19.

Sinha, R., Mihalcea, R.: Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. In: ICSC, pp. 363–369. IEEE Computer Society (2007)

20.

Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems (NIPS 2004), November 2004. This is a draft version from the NIPS preproceedings; the final version will be published by April 2005

21.

Staab, S., Hotho, A.: Ontology-based text document clustering. In: Klopotek, M.A., Wierzchon, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 22, pp. 451–452. Springer, Heidelberg (2003) CrossRef

22.

Ceglarek, D.: Architecture of the semantically enhanced intellectual property protection system. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 711–720. Springer, Heidelberg (2013) CrossRef

23.

Ceglarek, D.: Single-pass corpus to corpus comparison by sentence hashing. In: Badica, A., Trawinski, B., Nguyen, N.T. (eds.) Recent Developments in Computational Collective Intelligence - Concepts. Applications and Systems, volume 7092 of Studies in Computational Intelligence, pp. 167–177. Springer, Heidelberg (2013)

24.

Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)CrossRef

25.

Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 380–388. ACM (2002)

26.

Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC’94, Berkeley, CA, USA, p. 2. USENIX Association (1994)

27.

Stein, B., Lipka, N., Prettenhoferr, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2010). Springer, NetherlandsCrossRef

Titel: Semantic Compression for Text Document Processing
verfasst von: Dariusz Ceglarek
Verlag: Springer Berlin Heidelberg
Buch: Transactions on Computational Collective Intelligence XIV
Print ISBN: 978-3-662-44508-2

Electronic ISBN: 978-3-662-44509-9

Copyright-Jahr: 2014
DOI: https://doi.org/10.1007/978-3-662-44509-9_2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"