Skip to main content
Top
Published in: Cluster Computing 5/2019

16-12-2017

Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis

Authors: S. Gayathri Devi, M. Sabrigiriraj

Published in: Cluster Computing | Special Issue 5/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

During the past two decades, frequent pattern mining (FPM) has acquired the interests of many researchers: which involves extracting the itemsets from transactions, sequences from big dataset, which occurs frequently and to recognize from the molecular structures, the common sub graph. In this big data era, the unpredictable flow and huge quantity of data brings new challenges in FPM such as space and time complexity. In general, most of the research work focus on recognizing the patterns that occurs frequently, from the set of specific data, where the patterns within every transaction were definitely known a priori. Among these, the users focus only on the small part of this FP. In order to tackle such problems in the current scenario, it is necessary sometimes to select the important features alone, using appropriate FPM algorithms, in order to reduce the complexity level. The major objective of this work is to improve FPM mining results and improve classification accuracy of big dataset samples. To tackle the first challenge, the levy flight bat algorithm (LFBA) along with online feature selection (OFS) approach is proposed, which is used to filter the low quality features from the big data in an online manner. Subsequently to address the second challenge, a weighted entropy frequent pattern mining (WEFPM) is enforced for FPM, to accomplish better computation time when compared with other methods such as direct discriminative pattern mining (DDPMine) and iterative sampling based frequent itemset mining (ISbFIM), where enumeration of entire feature combinations were completed. So the WEFPM algorithm employed in this paper, targets to identify only the specific frequent patterns which are required by the user. By iterating this procedure, it assures that the acquired frequent patterns can be enumerated by using both the theoretical and empirical research, so that enumeration doesn’t proceed into a combinatorial explosion. And also, using the above said LFBA–OFS approach and WEFPM algorithm, frequent patterns that are different in nature, are generated for building high quality learning model. For finding the frequent patterns, here the minimum support threshold is matched with entropy. As a final step, multiple Kernel learning support vector machine is employed as a classifier, to evaluate the performance of the big data samples for efficiency and accuracy. Empirical study reveal that considerable progress is obtained in terms of accuracy and computation time when applied to UCI benchmark big datasets, using the proposed approach for efficient and effective FPM of the online features. It is clear that WEFPM is the most efficient method, because it produces higher average accuracy results of 92.34, 93.218, 91.374 and 87.87% values for adult, chess, hybo and sick dataset respectively. It outperforms when compared to other methods such as DDPMine and ISbFIM using an LIBSVM classifier.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Cai, C.H., Fu, A.W.C., Cheng, C.H., Kwong, W.W.: Mining association rules with weighted items. In: International Database Engineering and Applications Symposium, 1998 (IDEAS’98), pp. 68–77 (1998) Cai, C.H., Fu, A.W.C., Cheng, C.H., Kwong, W.W.: Mining association rules with weighted items. In: International Database Engineering and Applications Symposium, 1998 (IDEAS’98), pp. 68–77 (1998)
2.
go back to reference Zaki, M.J., Hsiao, C.: CHARM: an efficient algorithm for closed itemset mining. In: Proc. of SDM, pp. 457–473 (2002) Zaki, M.J., Hsiao, C.: CHARM: an efficient algorithm for closed itemset mining. In: Proc. of SDM, pp. 457–473 (2002)
3.
go back to reference Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proc. of ICDE, pp. 215–226 (2001) Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proc. of ICDE, pp. 215–226 (2001)
4.
go back to reference Washio, T., Motoda, H.: State of the art of graph-based data mining. ACM SIGKDD Explor. Newsl. 5(1), 59–68 (2003)CrossRef Washio, T., Motoda, H.: State of the art of graph-based data mining. ACM SIGKDD Explor. Newsl. 5(1), 59–68 (2003)CrossRef
5.
go back to reference Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proc. of SIGMOD, pp. 335–346 (2004) Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proc. of SIGMOD, pp. 335–346 (2004)
6.
go back to reference Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2012)MATH Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2012)MATH
7.
go back to reference Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(107–113), 12 (2008) Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(107–113), 12 (2008)
8.
go back to reference Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., Yu, P., Verscheure, O.: Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Proceeding of KDD ’08, pp. 230–238. ACM, New York (2008) Fan, W., Zhang, K., Cheng, H., Gao, J., Yan, X., Han, J., Yu, P., Verscheure, O.: Direct mining of discriminative and essential frequent patterns via model-based search tree. In: Proceeding of KDD ’08, pp. 230–238. ACM, New York (2008)
9.
go back to reference Shintani, T., Kitsuregawa, M.: Parallel mining algorithms for generalized association rules with classification hierarchy. ACM SIGMOD Record 27(2), 25–36 (1998)CrossRef Shintani, T., Kitsuregawa, M.: Parallel mining algorithms for generalized association rules with classification hierarchy. ACM SIGMOD Record 27(2), 25–36 (1998)CrossRef
10.
go back to reference Borgelt, C., Kruse, R.: Induction of association rules: a priori implementation. In: Compstat, pp. 395–400 (2002) Borgelt, C., Kruse, R.: Induction of association rules: a priori implementation. In: Compstat, pp. 395–400 (2002)
11.
go back to reference Pan F., Cong, G., Tung, A.K.H., Yang, J., Zaki, M.J.: CARPENTER: finding closed patterns in long biological datasets. In: Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2003) Pan F., Cong, G., Tung, A.K.H., Yang, J., Zaki, M.J.: CARPENTER: finding closed patterns in long biological datasets. In: Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2003)
12.
go back to reference Pan, F., Tung, A.K.H., Cong, G., Xu, X.: COBBLER: combining column and row enumeration for closed pattern discovery. In: Proc. 2004 Int. Conf. on Scientific and Statistical Database Management (SSDBM’04), Santorini Island, Greece, pp. 21–30 (2004) Pan, F., Tung, A.K.H., Cong, G., Xu, X.: COBBLER: combining column and row enumeration for closed pattern discovery. In: Proc. 2004 Int. Conf. on Scientific and Statistical Database Management (SSDBM’04), Santorini Island, Greece, pp. 21–30 (2004)
13.
go back to reference Cong, G., Tan, K.-L., Tung, A.K.H., Xu, X.: Mining top-k covering rule groups for gene expression data. In: 24th ACM International Conference on Management of Data (2005) Cong, G., Tan, K.-L., Tung, A.K.H., Xu, X.: Mining top-k covering rule groups for gene expression data. In: 24th ACM International Conference on Management of Data (2005)
14.
go back to reference Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent item set mining algorithms on mapreduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC’12, pp 76:1–76:8. ACM, New York (2012) Lin, M.Y., Lee, P.Y., Hsueh, S.C.: Apriori-based frequent item set mining algorithms on mapreduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC’12, pp 76:1–76:8. ACM, New York (2012)
15.
go back to reference Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for discovery of association rules. Data Min. Knowl. Discov. 1, 343–373 (1997)CrossRef Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for discovery of association rules. Data Min. Knowl. Discov. 1, 343–373 (1997)CrossRef
16.
go back to reference Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys’08, pp. 107–114. ACM, New York (2008) Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys’08, pp. 107–114. ACM, New York (2008)
17.
go back to reference Yang, G.: Computational aspects of mining maximal frequent patterns. Theor. Comput. Sci. 362(1–3), 63–85 (2006)MathSciNetCrossRef Yang, G.: Computational aspects of mining maximal frequent patterns. Theor. Comput. Sci. 362(1–3), 63–85 (2006)MathSciNetCrossRef
18.
go back to reference Wang, J., Zhao, P., Hoi, S.C., Jin, R.: Online feature selection and its applications. IEEE Trans. Knowl. Data Eng. 26(3), 698–710 (2014)CrossRef Wang, J., Zhao, P., Hoi, S.C., Jin, R.: Online feature selection and its applications. IEEE Trans. Knowl. Data Eng. 26(3), 698–710 (2014)CrossRef
19.
go back to reference Hoi, S.C., Wang, J., Zhao, P., Jin, R.: Online feature selection for mining big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pp. 93–100 (2012) Hoi, S.C., Wang, J., Zhao, P., Jin, R.: Online feature selection for mining big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pp. 93–100 (2012)
20.
go back to reference Aridhi, S., d’Orazio, L., Maddouri, M., Nguifo, E.M.: Density based data partitioning strategy to approximate large-scale subgraph mining. Inf. Syst. 48, 213–223 (2015)CrossRef Aridhi, S., d’Orazio, L., Maddouri, M., Nguifo, E.M.: Density based data partitioning strategy to approximate large-scale subgraph mining. Inf. Syst. 48, 213–223 (2015)CrossRef
21.
go back to reference Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yafim: a parallel frequent itemset mining algorithm with spark. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 1664–1671 (2014) Qiu, H., Gu, R., Yuan, C., Huang, Y.: Yafim: a parallel frequent itemset mining algorithm with spark. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 1664–1671 (2014)
22.
go back to reference Cheng, H., Yan, X., Han, J., Hsu, C.W.: Discriminative frequent pattern analysis for effective classification. In: International Conference on Data Engineering, pp. 716–725 (2007) Cheng, H., Yan, X., Han, J., Hsu, C.W.: Discriminative frequent pattern analysis for effective classification. In: International Conference on Data Engineering, pp. 716–725 (2007)
23.
go back to reference Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: Proceedings of ICDM ’08. IEEE Computer Society, Washington, DC, pp. 169–178 (2008) Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: Proceedings of ICDM ’08. IEEE Computer Society, Washington, DC, pp. 169–178 (2008)
24.
go back to reference Wu, X., Fan, W., Peng, J., Zhang, K., Yu, Y.: Iterative sampling based frequent itemset mining for big data. Int. J. Mach. Learn. Cybern. 6(6), 875–882 (2015)CrossRef Wu, X., Fan, W., Peng, J., Zhang, K., Yu, Y.: Iterative sampling based frequent itemset mining for big data. Int. J. Mach. Learn. Cybern. 6(6), 875–882 (2015)CrossRef
25.
go back to reference Gole, S., Tidke, B.: ClustBIGFIM-frequent itemset mining of big data using pre-processing based on mapreduce framework. Int. J. Found. Comput. Sci. Technol. 5(3), 79–89 (2015)CrossRef Gole, S., Tidke, B.: ClustBIGFIM-frequent itemset mining of big data using pre-processing based on mapreduce framework. Int. J. Found. Comput. Sci. Technol. 5(3), 79–89 (2015)CrossRef
26.
go back to reference Gawwad, M.A., Ahmed, M.F., Fayek, M.B.: Frequent itemset mining for big data using greatest common divisor technique. Data Sci. J. 16(25), 1–10 (2017) Gawwad, M.A., Ahmed, M.F., Fayek, M.B.: Frequent itemset mining for big data using greatest common divisor technique. Data Sci. J. 16(25), 1–10 (2017)
27.
go back to reference Hasançebi, O., Teke, T., Pekcan, O.: A bat-inspired algorithm for structural optimization. Comput. Struct. 128, 77–90 (2013)CrossRef Hasançebi, O., Teke, T., Pekcan, O.: A bat-inspired algorithm for structural optimization. Comput. Struct. 128, 77–90 (2013)CrossRef
28.
go back to reference Xie, J., Zhou, Y., Chen, H.: A novel bat algorithm based on differential operator and Lévy flights trajectory. Comput. Intell. Neurosci. 2013, 1–13 (2013)CrossRef Xie, J., Zhou, Y., Chen, H.: A novel bat algorithm based on differential operator and Lévy flights trajectory. Comput. Intell. Neurosci. 2013, 1–13 (2013)CrossRef
29.
go back to reference Yilmaz, S., Küçüksille, E.U.: A new modification approach on bat algorithm for solving optimization problems. Appl. Soft Comput. 28, 259–275 (2015)CrossRef Yilmaz, S., Küçüksille, E.U.: A new modification approach on bat algorithm for solving optimization problems. Appl. Soft Comput. 28, 259–275 (2015)CrossRef
30.
go back to reference Yang, X.-S., Deb, S.: Eagle strategy using Lévy walk and firefly algorithms for stochastic optimization. Stud. Comput. Intell. 284, 101–111 (2010)MATH Yang, X.-S., Deb, S.: Eagle strategy using Lévy walk and firefly algorithms for stochastic optimization. Stud. Comput. Intell. 284, 101–111 (2010)MATH
31.
go back to reference Mantegna, R.N.: Fast, accurate algorithm for numerical simulation of Lévy stable stochastic processes. Phys. Rev. E 49(5), 4677–4683 (1994)CrossRef Mantegna, R.N.: Fast, accurate algorithm for numerical simulation of Lévy stable stochastic processes. Phys. Rev. E 49(5), 4677–4683 (1994)CrossRef
32.
go back to reference Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)MATH Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)MATH
33.
go back to reference Cao, H., Naito, T., Ninomiya, Y.: Approximate RBF kernel SVM and its applications in pedestrian classification. In: The 1st International Workshop on Machine Learning for Vision-Based Motion Analysis-MLVMA’08 (2008) Cao, H., Naito, T., Ninomiya, Y.: Approximate RBF kernel SVM and its applications in pedestrian classification. In: The 1st International Workshop on Machine Learning for Vision-Based Motion Analysis-MLVMA’08 (2008)
34.
go back to reference Yekkehkhany, B., Safari, A., Homayouni, S., Hasanlou, M.: A comparison study of different Kernel functions for SVM-based classification of multi-temporal polarimetry SAR data. Int. Arch. Photogramm. Remote Sens. Spat. Inform. Sci. 40(2), 281–285 (2014) Yekkehkhany, B., Safari, A., Homayouni, S., Hasanlou, M.: A comparison study of different Kernel functions for SVM-based classification of multi-temporal polarimetry SAR data. Int. Arch. Photogramm. Remote Sens. Spat. Inform. Sci. 40(2), 281–285 (2014)
35.
go back to reference Lanckriet, G., De Bie, T., Cristianini, N., Jordan, M.I., Stafford, Noble W.: A statistical framework for genomic data fusion. Bioinfomatics 20(16), 2626–2635 (2004)CrossRef Lanckriet, G., De Bie, T., Cristianini, N., Jordan, M.I., Stafford, Noble W.: A statistical framework for genomic data fusion. Bioinfomatics 20(16), 2626–2635 (2004)CrossRef
36.
go back to reference Tsochantaridis, I., Hoffmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and sturcutured output spaces. In: Proceedings of the 16th International Conference on Machine Learning (2004) Tsochantaridis, I., Hoffmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and sturcutured output spaces. In: Proceedings of the 16th International Conference on Machine Learning (2004)
37.
go back to reference Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27–27 (2011)CrossRef Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27–27 (2011)CrossRef
Metadata
Title
Swarm intelligent based online feature selection (OFS) and weighted entropy frequent pattern mining (WEFPM) algorithm for big data analysis
Authors
S. Gayathri Devi
M. Sabrigiriraj
Publication date
16-12-2017
Publisher
Springer US
Published in
Cluster Computing / Issue Special Issue 5/2019
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1489-9

Other articles of this Special Issue 5/2019

Cluster Computing 5/2019 Go to the issue

Premium Partner