Skip to main content

2018 | OriginalPaper | Buchkapitel

A Novel Map-Reduce Based Augmented Clustering Algorithm for Big Text Datasets

verfasst von : K. V. Kanimozhi, M. Venkatesan

Erschienen in: Data Engineering and Intelligent Computing

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text clustering is a well known technique for improving quality in information retrieval, In Today’s real world data is not organized in the essential manner for a precise mining, given a large unstructured text document collection it is essential to organize into clusters of related documents. It is a contemporary challenge to explore compact and meaning insights from large collections of the unstructured text documents. Although many frequent item mining algorithms have been discovered yet most do not scale for “Big Data” and also takes more processing time. This paper presents a high scalable speedy and efficient map reduce based augmented clustering algorithm based on bivariate n-gram frequent item to reduce high dimensionality and derive high quality clusters for Big Text documents and also the comparative analysis is shown for the sample text datasets with stop word removal the proposed algorithm performs better than without stop word removal.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Kanimozhi, K.V., Venkatesan, M.: Survey on text clustering techniques. Adv. Res. Electr. Electron. Eng. 2(12), 55–58 (2015) Kanimozhi, K.V., Venkatesan, M.: Survey on text clustering techniques. Adv. Res. Electr. Electron. Eng. 2(12), 55–58 (2015)
2.
Zurück zum Zitat Kanimozhi, K.V., Venkatesan, M.: Big text datasets Clustering based on frequent item sets—a survey. Int. J. Innovat. Res. Sci. Eng. 2(5). ISSN: 2454– 9665 (2016) Kanimozhi, K.V., Venkatesan, M.: Big text datasets Clustering based on frequent item sets—a survey. Int. J. Innovat. Res. Sci. Eng. 2(5). ISSN: 2454– 9665 (2016)
3.
Zurück zum Zitat Naaz, E., Sharma, D., Sirisha, D., Venkatesan, M.: Enhanced k-means Clustering approach for health care analysis using clinical documents. Int. J. Pharm. Clin. Res. 8(1), 60–64. ISSN- 0975 1556 (2016) Naaz, E., Sharma, D., Sirisha, D., Venkatesan, M.: Enhanced k-means Clustering approach for health care analysis using clinical documents. Int. J. Pharm. Clin. Res. 8(1), 60–64. ISSN- 0975 1556 (2016)
4.
Zurück zum Zitat Venkatesan, M., Thangavelu, A.: A multiple window based Co-location pattern mining approach for various types of spatial Data. Int. J. Comput. Appl. Technol. 48(2), 144–154 (2013). Inderscience Publisher Venkatesan, M., Thangavelu, A.: A multiple window based Co-location pattern mining approach for various types of spatial Data. Int. J. Comput. Appl. Technol. 48(2), 144–154 (2013). Inderscience Publisher
5.
Zurück zum Zitat Venkatesan, M., Thangavelu, A.: A Delaunay Diagram-based Min–Max CP-Tree Algorithm for Spatial Data Analysis, WIREs Data Mining and Knowledge Discovery, vol. 5, pp. 142–154. Wiley Publisher (2015) Venkatesan, M., Thangavelu, A.: A Delaunay Diagram-based Min–Max CP-Tree Algorithm for Spatial Data Analysis, WIREs Data Mining and Knowledge Discovery, vol. 5, pp. 142–154. Wiley Publisher (2015)
6.
Zurück zum Zitat Venkatesan, M., Thangavelu, A., Prabhavathy, P.: A Novel Cp-Tree based Co-located Classifier for big data analysis. Int. J. Commun. Netw. Distrib. Syst. 15, 191–211 (2015). Inderscience Venkatesan, M., Thangavelu, A., Prabhavathy, P.: A Novel Cp-Tree based Co-located Classifier for big data analysis. Int. J. Commun. Netw. Distrib. Syst. 15, 191–211 (2015). Inderscience
7.
Zurück zum Zitat Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques: KDD-2000 Workshop on Text Mining (2000) Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques: KDD-2000 Workshop on Text Mining (2000)
8.
Zurück zum Zitat Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68, 1271–1288 (2009). Elsevier Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68, 1271–1288 (2009). Elsevier
9.
Zurück zum Zitat Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martinez-Trinidad, J.F.: Document clustering based on maximal frequent sequences. In: Proceedings of FinTAL 2006, LNAI, vol. 4139, pp. 257–67 (2006) Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martinez-Trinidad, J.F.: Document clustering based on maximal frequent sequences. In: Proceedings of FinTAL 2006, LNAI, vol. 4139, pp. 257–67 (2006)
10.
Zurück zum Zitat Beil, F., Ester, M., Xu, X.: Frequent term based text clustering. In: Proceedings of ACM SIGKDD International Conference on knowledge Discovery and Data Mining. pp. 436–442 (2002) Beil, F., Ester, M., Xu, X.: Frequent term based text clustering. In: Proceedings of ACM SIGKDD International Conference on knowledge Discovery and Data Mining. pp. 436–442 (2002)
11.
Zurück zum Zitat Fung, B., Wang, K., Ester, M.: Hierarchal document clustering using frequent item sets. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003) Fung, B., Wang, K., Ester, M.: Hierarchal document clustering using frequent item sets. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)
12.
Zurück zum Zitat Moens, S., Aksehirli, E., Goethals, B.: Frequent Item set Mining for Big data (2014) Moens, S., Aksehirli, E., Goethals, B.: Frequent Item set Mining for Big data (2014)
13.
Zurück zum Zitat Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Parallel FP-Growth for query recommendation. In: Proceedings of ACM Conference on Recommender systems, pp 107–114 (2008) Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Parallel FP-Growth for query recommendation. In: Proceedings of ACM Conference on Recommender systems, pp 107–114 (2008)
14.
Zurück zum Zitat Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent item set mining algorithm with spark. In: 28th International Parallel & Distributed Processing Symposium Workshops. IEEE (2014) Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent item set mining algorithm with spark. In: 28th International Parallel & Distributed Processing Symposium Workshops. IEEE (2014)
15.
Zurück zum Zitat Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text Clustering using frequent item sets. Knowl.-based Syst. 23, 379–388 (2010). Elsevier Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text Clustering using frequent item sets. Knowl.-based Syst. 23, 379–388 (2010). Elsevier
Metadaten
Titel
A Novel Map-Reduce Based Augmented Clustering Algorithm for Big Text Datasets
verfasst von
K. V. Kanimozhi
M. Venkatesan
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-3223-3_41

Premium Partner