Skip to main content

2014 | OriginalPaper | Buchkapitel

Concept Based Clustering of Documents with Missing Semantic Information

verfasst von : E. Anupriya, N. Ch. S. N. Iyengar

Erschienen in: Intelligent Computing, Networking, and Informatics

Verlag: Springer India

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Today, every new document added to the Web is augmented with semantic information (i.e., information about the content) which identifies the class of the document. The information is either added as keywords, or implicitly known from structural information like title, body text, or added as objects and their relationship (rich data format). But, the documents that enriched the Web five or ten years back do not contain semantic information. The objective of this paper is to cluster documents with missing semantic information. It is performed by adopting frequent term-based method exploiting the lexical and structural relation between keywords in the document. Similarity histogram clustering algorithm has been used to cluster the documents after deriving semantic information on concepts which identifies the class of the document. The results illustrate that the concept-based clustering performs well compared to statistical clustering k-means but suffers from proper subset selection of frequent terms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat William, G., Kenneth, C., David, Y.: A method for disambiguating word senses in a large corpus. Common Methodol. Humanit. Comput. Comput. Linguistics 26, 415–439 (1992) William, G., Kenneth, C., David, Y.: A method for disambiguating word senses in a large corpus. Common Methodol. Humanit. Comput. Comput. Linguistics 26, 415–439 (1992)
2.
Zurück zum Zitat Han, J., Kamber, M.: Data mining concepts and techniques, 2nd edn. ISBN:9781558609013 Han, J., Kamber, M.: Data mining concepts and techniques, 2nd edn. ISBN:9781558609013
3.
Zurück zum Zitat Chakrabarti, S.: Mining the web: discovering knowledge from HyperText data. Morgan Kaufmann (2003) Chakrabarti, S.: Mining the web: discovering knowledge from HyperText data. Morgan Kaufmann (2003)
4.
Zurück zum Zitat Kosala, R., Blockeed, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)CrossRef Kosala, R., Blockeed, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)CrossRef
5.
Zurück zum Zitat Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Trans. 15, 871–882 (2003)CrossRef Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Trans. 15, 871–882 (2003)CrossRef
6.
Zurück zum Zitat Miller, G.A.: Nouns in wordnet: a lexical inheritance system. Int. J. Lexicography 3(4), 245–264 (1990)CrossRef Miller, G.A.: Nouns in wordnet: a lexical inheritance system. Int. J. Lexicography 3(4), 245–264 (1990)CrossRef
8.
Zurück zum Zitat Hammouda, K., Kamel, M.: Distributed collaborative Web Document Clustering using cluster keyphrase summaries. Inf. Fusion 9, 465–480 (2008)CrossRef Hammouda, K., Kamel, M.: Distributed collaborative Web Document Clustering using cluster keyphrase summaries. Inf. Fusion 9, 465–480 (2008)CrossRef
9.
Zurück zum Zitat Papapetrou, O., Siberski, W., Nejdl, W.: PCIR: combining DHTs and peer clusters for efficient full text P2P indexing. Comput. Netw. 54, 2019–2040 (2010)CrossRef Papapetrou, O., Siberski, W., Nejdl, W.: PCIR: combining DHTs and peer clusters for efficient full text P2P indexing. Comput. Netw. 54, 2019–2040 (2010)CrossRef
10.
Zurück zum Zitat Lin, D.: An information-theortic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning (ICML’95), pp. 296–304. San Franciso, CA, USA (1998) Lin, D.: An information-theortic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning (ICML’95), pp. 296–304. San Franciso, CA, USA (1998)
11.
Zurück zum Zitat Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: IJCAI, pp. 448–453 (1995) Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: IJCAI, pp. 448–453 (1995)
Metadaten
Titel
Concept Based Clustering of Documents with Missing Semantic Information
verfasst von
E. Anupriya
N. Ch. S. N. Iyengar
Copyright-Jahr
2014
Verlag
Springer India
DOI
https://doi.org/10.1007/978-81-322-1665-0_57