Skip to main content
Erschienen in: Cluster Computing 5/2019

04.09.2017

A hybrid approach for mismatch data reduction in datasets and guide data mining

verfasst von: R. Dhanalakshmi, T. Sethukarasi

Erschienen in: Cluster Computing | Sonderheft 5/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

An outlier is a set of data that distinctly differ from rest of the data in a dataset defined as normal. Detection of outlier is an active area of research in data mining. If clustering methods are used, the elements that are lying outside the clusters are focused and detected as outliers. But it is not true few unknown elements will become a part of the cluster. So to ignore the irrelevant data completely from the data set, it becomes necessary to identify and eliminate these data merged with the clusters. An efficient hybrid approach is proposed to reduce the number of outliers. Two algorithms namely multilayer neural networks (MLN) and weighted-K means adopted for datamining are employed in proposed approach to identify outliers in a data group. This approach guides and results in better cluster formation. Each element in the dataset provided as input to MLN after assigning weights by weighted K-means. MLN is trained to reproduce the normal input data (inliers) and ensures that groups formed by weighted K-means are consisting of inliers only. Among the outlier detection methods presented in literature for outlier detection in data mining, the proposed method is based on Integrating Semantic Knowledge. This method relates the data point is an outlier by identifying the behaviour of the data elements that differ from other data elements belonging to the same cluster or class. The principle intention of this research work is to reduce the amount of outliers by enhancing the performance of clustering or classification techniques that guides to improve accuracy and reduce the mean square error. The test results provides evident to supremacy of the proposed strategy in reducing the outlier.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Han, J., Kamber, M.: Data Mining—Concepts & Techniques. Morgan Kaufmann Publishers, Academic Press (2001) Han, J., Kamber, M.: Data Mining—Concepts & Techniques. Morgan Kaufmann Publishers, Academic Press (2001)
2.
Zurück zum Zitat Sankar Rajagopal, D.R.: Customer data clustering using data mining technique. Int. J. Database Manag. Syst. 3(4) (2011) Sankar Rajagopal, D.R.: Customer data clustering using data mining technique. Int. J. Database Manag. Syst. 3(4) (2011)
3.
Zurück zum Zitat Yabing, J.: Research of an improved apriori algorithm in data mining association rule. Int. J. Comput. Commun. Eng. 2(1), 25 (2013)CrossRef Yabing, J.: Research of an improved apriori algorithm in data mining association rule. Int. J. Comput. Commun. Eng. 2(1), 25 (2013)CrossRef
4.
Zurück zum Zitat Fournier-Viger, P., Lin, J.C.W., Kiran, R.U., Koh, Y.S., Thomas, R.: A survey of sequential pattern mining. Data Sci. Pattern Recognit. Ubiquitous Int. 1(1) (2017) Fournier-Viger, P., Lin, J.C.W., Kiran, R.U., Koh, Y.S., Thomas, R.: A survey of sequential pattern mining. Data Sci. Pattern Recognit. Ubiquitous Int. 1(1) (2017)
5.
Zurück zum Zitat Lin, L., Ye, J., Deng, F., Xiong, S., Zhong, L.: A comparison study of clustering algorithms for microblog posts. Cluster Comput. 19(3), 1333–1345 (2016)CrossRef Lin, L., Ye, J., Deng, F., Xiong, S., Zhong, L.: A comparison study of clustering algorithms for microblog posts. Cluster Comput. 19(3), 1333–1345 (2016)CrossRef
6.
Zurück zum Zitat Kamila, N.K., Jena, L., Bhuyan, H.K.: Pareto-based multi-objective optimization for classification in data mining. Cluster Comput. 19(4), 1723–1745 (2016)CrossRef Kamila, N.K., Jena, L., Bhuyan, H.K.: Pareto-based multi-objective optimization for classification in data mining. Cluster Comput. 19(4), 1723–1745 (2016)CrossRef
7.
Zurück zum Zitat Wang, J., Su, X.: An improved K-means clustering algorithm. In: 2011 IEEE 3rd International Conference on Communication Software and Networks, Xi’an, pp. 44–46 (2011) Wang, J., Su, X.: An improved K-means clustering algorithm. In: 2011 IEEE 3rd International Conference on Communication Software and Networks, Xi’an, pp. 44–46 (2011)
8.
Zurück zum Zitat Fawcett, T., Provost, F.: Adaptive fraud detection. Data Min. Knowl. Discov. J. 1(3), 291–316 (1997)CrossRef Fawcett, T., Provost, F.: Adaptive fraud detection. Data Min. Knowl. Discov. J. 1(3), 291–316 (1997)CrossRef
9.
Zurück zum Zitat DuMouchel, W., Schonlau, M.: A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 189–193 (1998) DuMouchel, W., Schonlau, M.: A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 189–193 (1998)
10.
Zurück zum Zitat Williams, G., Huang, Z.: Advanced topics in artificial intelligence. In: Sattar, A. (ed.) Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases. Lecture Notes in Artificial Intelligence, vol. 1342, pp. 340–348. Springer, Berlin (1997) Williams, G., Huang, Z.: Advanced topics in artificial intelligence. In: Sattar, A. (ed.) Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases. Lecture Notes in Artificial Intelligence, vol. 1342, pp. 340–348. Springer, Berlin (1997)
11.
Zurück zum Zitat Yamanishi, K., Takeuchi, J., Williams, G., Milne, P.: On-line unsupervised outlier detection using finite mixtures with discounting learning algorithm. In: Proceedings of KDD2000, pp. 320–324 (2000) Yamanishi, K., Takeuchi, J., Williams, G., Milne, P.: On-line unsupervised outlier detection using finite mixtures with discounting learning algorithm. In: Proceedings of KDD2000, pp. 320–324 (2000)
12.
Zurück zum Zitat Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, International Conference on Management of Data (2000) Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, International Conference on Management of Data (2000)
13.
Zurück zum Zitat Ramaswamy, S., Rastogi, R., Shim K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of International Conference on Management of Data, ACM-SIGMOD, Dallas (2000) Ramaswamy, S., Rastogi, R., Shim K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of International Conference on Management of Data, ACM-SIGMOD, Dallas (2000)
14.
Zurück zum Zitat Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24rd International Conference on Very Large Data Bases (VLDB), pp. 392–403 (1998) Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24rd International Conference on Very Large Data Bases (VLDB), pp. 392–403 (1998)
15.
Zurück zum Zitat Atkinson, A.C.: Fast very robust methods for the detection of multiple outliers. J. Am. Stat. Assoc. 89, 1329–1339 (1994)CrossRef Atkinson, A.C.: Fast very robust methods for the detection of multiple outliers. J. Am. Stat. Assoc. 89, 1329–1339 (1994)CrossRef
16.
Zurück zum Zitat Kosinksi, A.S.: A procedure for the detection of multivariate outliers. Comput. Stat. Data Anal. 29 (1999) Kosinksi, A.S.: A procedure for the detection of multivariate outliers. Comput. Stat. Data Anal. 29 (1999)
17.
Zurück zum Zitat Knorr, E., Ng, R.: A unified approach for mining outliers. In: Proceedings of KDD, pp. 219–222 (1997) Knorr, E., Ng, R.: A unified approach for mining outliers. In: Proceedings of KDD, pp. 219–222 (1997)
18.
Zurück zum Zitat Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th International Conference on Very Large Data Bases, VLDB, pp. 392– 403, 24–27 (1998) Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th International Conference on Very Large Data Bases, VLDB, pp. 392– 403, 24–27 (1998)
19.
Zurück zum Zitat Huang, J.Z., et al.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)CrossRef Huang, J.Z., et al.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)CrossRef
20.
Zurück zum Zitat Chan, E.Y., et al.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit. 37(5), 943–952 (2004)CrossRef Chan, E.Y., et al.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit. 37(5), 943–952 (2004)CrossRef
21.
Zurück zum Zitat Huang, J.Z., et al.: Weighting method for feature selection in K-means. In: Computational Methods of Feature Selection , pp. 193–210 (2008) Huang, J.Z., et al.: Weighting method for feature selection in K-means. In: Computational Methods of Feature Selection , pp. 193–210 (2008)
22.
Zurück zum Zitat de Amorim, R.C., Mirkin, B.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit. 45(3), 1061–1075 (2012)CrossRef de Amorim, R.C., Mirkin, B.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit. 45(3), 1061–1075 (2012)CrossRef
23.
Zurück zum Zitat Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Databases 12(1), 5–26 (2002)CrossRef Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Databases 12(1), 5–26 (2002)CrossRef
24.
Zurück zum Zitat Lozano, E., Acuna, E.: Parallel algorithms for distance-based and density-based outliers. In: Proceedings of Fifth IEEE International Conference on Data Mining (ICDM), pp. 729–732 (2005) Lozano, E., Acuna, E.: Parallel algorithms for distance-based and density-based outliers. In: Proceedings of Fifth IEEE International Conference on Data Mining (ICDM), pp. 729–732 (2005)
25.
Zurück zum Zitat Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2003) Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2003)
26.
Zurück zum Zitat Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th VLDB, pp. 144–155 (1994) Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th VLDB, pp. 144–155 (1994)
27.
Zurück zum Zitat Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp. 226–231 (1999) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp. 226–231 (1999)
28.
Zurück zum Zitat Zhang, T., Ramakrishnan, R., Livny M.: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp. 103–114 (1996) Zhang, T., Ramakrishnan, R., Livny M.: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp. 103–114 (1996)
29.
Zurück zum Zitat Kollios, G., Gunopoulos, D., Koudas, N., Berchtold, S.: An efficient approximation scheme for data mining tasks. In: ICDE (2001) Kollios, G., Gunopoulos, D., Koudas, N., Berchtold, S.: An efficient approximation scheme for data mining tasks. In: ICDE (2001)
30.
Zurück zum Zitat Bartkowiak, A., Szustalewicz, A.: Detecting multivariate outliers by a grand tour. Mach. Graph. Vis. 6(4), 487–505 (1997) Bartkowiak, A., Szustalewicz, A.: Detecting multivariate outliers by a grand tour. Mach. Graph. Vis. 6(4), 487–505 (1997)
31.
Zurück zum Zitat Williams, G., Altas, I., Bakin, S., Christen, P., Hegland, Markus, Marquez, Alonso, Milne, Peter, Nagappan, Rajehndra, Roberts, Stephen: Large-scale parallel data mining, LNAI state-of-the art survey. In: Zaki, M.J., Ho, C.-T. (eds.) The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project, pp. 24–54. Springer, Berlin (2000) Williams, G., Altas, I., Bakin, S., Christen, P., Hegland, Markus, Marquez, Alonso, Milne, Peter, Nagappan, Rajehndra, Roberts, Stephen: Large-scale parallel data mining, LNAI state-of-the art survey. In: Zaki, M.J., Ho, C.-T. (eds.) The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project, pp. 24–54. Springer, Berlin (2000)
32.
Zurück zum Zitat Swayne, D.F., Cook, D., Buja A.: XGobi: interactive dynamic graphics in the X window system with a link to S. In: Proceedings of the ASA Section on Statistical Graphics, pp. 1–8, Alexandria, VA. American Statistical Association (1991) Swayne, D.F., Cook, D., Buja A.: XGobi: interactive dynamic graphics in the X window system with a link to S. In: Proceedings of the ASA Section on Statistical Graphics, pp. 1–8, Alexandria, VA. American Statistical Association (1991)
33.
Zurück zum Zitat Sykacek, P.: Equivalent error bars for neural network classifiers trained by Bayesian inference. In: Proceedings of ESANN (1997) Sykacek, P.: Equivalent error bars for neural network classifiers trained by Bayesian inference. In: Proceedings of ESANN (1997)
34.
Zurück zum Zitat Ackley, D.H., Hinton, G.E., Sejinowski, T.J.: A learning algorithm for boltzmann machines. Cognit. Sci. 9, 147–169 (1985)CrossRef Ackley, D.H., Hinton, G.E., Sejinowski, T.J.: A learning algorithm for boltzmann machines. Cognit. Sci. 9, 147–169 (1985)CrossRef
35.
Zurück zum Zitat Hecht-Nielsen, R.: Replicator neural networks for universal optimal source coding. Science 269, 1860–1863 (1995)CrossRef Hecht-Nielsen, R.: Replicator neural networks for universal optimal source coding. Science 269, 1860–1863 (1995)CrossRef
36.
Zurück zum Zitat Hampel, F.R.: The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 69, 383–393 (1974)MathSciNetCrossRef Hampel, F.R.: The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 69, 383–393 (1974)MathSciNetCrossRef
37.
Zurück zum Zitat Hawkins, S., He, H., Williams, G.J., Baxter, R.A.: DaWaK 2002. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) Outlier detection using replicator neural networks. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002)MATH Hawkins, S., He, H., Williams, G.J., Baxter, R.A.: DaWaK 2002. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) Outlier detection using replicator neural networks. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002)MATH
38.
Zurück zum Zitat Zhao, X., Liang, J., Cao, F.: A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cyber. 5, 469–477 (2014) Zhao, X., Liang, J., Cao, F.: A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cyber. 5, 469–477 (2014)
39.
Zurück zum Zitat Zengyou, H., Shengchun, D., Xiaofei, X., Huang, J.Z.: A fast greedy algorithm for outlier mining. Applications of Evolutionary Computing. In: Proceedings of the EvoWorkshops 2006: EvoBIO, EvoCOMNET, EvoHOT EvoIASP, EvoINTERACTION, EvoMUSART, and EvoSTOC. LNCS, vol. 3907, pp. 567–576 (2006) Zengyou, H., Shengchun, D., Xiaofei, X., Huang, J.Z.: A fast greedy algorithm for outlier mining. Applications of Evolutionary Computing. In: Proceedings of the EvoWorkshops 2006: EvoBIO, EvoCOMNET, EvoHOT EvoIASP, EvoINTERACTION, EvoMUSART, and EvoSTOC. LNCS, vol. 3907, pp. 567–576 (2006)
40.
Zurück zum Zitat Zhang, W., Wu, J., Yu, J.: An improved method of outlier detection based on frequent pattern. In: Proceeding of WASE International Conference on Information Engineering (2010) Zhang, W., Wu, J., Yu, J.: An improved method of outlier detection based on frequent pattern. In: Proceeding of WASE International Conference on Information Engineering (2010)
41.
Zurück zum Zitat Otey, M.E., Ghoting, A., Parthasarathy, A.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. (2006) Otey, M.E., Ghoting, A., Parthasarathy, A.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. (2006)
Metadaten
Titel
A hybrid approach for mismatch data reduction in datasets and guide data mining
verfasst von
R. Dhanalakshmi
T. Sethukarasi
Publikationsdatum
04.09.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe Sonderheft 5/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1137-4

Weitere Artikel der Sonderheft 5/2019

Cluster Computing 5/2019 Zur Ausgabe