Skip to main content
Top
Published in: Soft Computing 10/2021

08-03-2021 | Methodologies and Application

A distributed ensemble of relevance vector machines for large-scale data sets on Spark

Authors: Wangchen Qin, Fang Liu, Mi Tong, Zhengying Li

Published in: Soft Computing | Issue 10/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The relevance vector machine (RVM) is a machine learning algorithm based on sparse Bayesian theory that shows good classification performance for small-scale data sets. However, due to the high runtime complexity \(O\left( n^{3}\right) \) and space complexity \(O\left( n^{2}\right) \) of the RVM, it is difficult to train models for medium-sized or large-scale data sets. Therefore, a distributed ensemble of relevance vector machines on the Spark framework (DE-RVM) is proposed. In this approach, a data set is divided into a number of disjoint subsets of data, and on each subset, a set of RVM classifiers are trained using the AdaBoostRVM based on sample type (STAB-RVM) according to the concept of ensemble learning. A final classifier is generated by the combination method with a diversity measure for the RVM classifiers. The smallest empirical loss of the combinatorial classifier is obtained in the quadratic programming problem. The algorithm was applied to both artificial data sets and real data sets. The experimental results show that the proposed method offers good classification performance and can effectively improve the ability of the RVM to process large-scale data sets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev: Data Min Knowl Discov 3(1):37–61 Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev: Data Min Knowl Discov 3(1):37–61
go back to reference Barddal JP, Barddal JP, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23 Barddal JP, Barddal JP, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23
go back to reference Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55CrossRef Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55CrossRef
go back to reference Bi Y (2012) The impact of diversity on the accuracy of evidential classifier ensembles. Int J Approx Reason 53(4):584–607MathSciNetCrossRef Bi Y (2012) The impact of diversity on the accuracy of evidential classifier ensembles. Int J Approx Reason 53(4):584–607MathSciNetCrossRef
go back to reference Candela JQ, Hansen LK (2004) Learning with uncertainty-Gaussian processes and relevance vector machines (Doctoral dissertation, unknown) Candela JQ, Hansen LK (2004) Learning with uncertainty-Gaussian processes and relevance vector machines (Doctoral dissertation, unknown)
go back to reference Choi TM, Chan HK, Yue X (2017) Recent development in big data analytics for business operations and risk management. IEEE Trans Cybern 47(1):81–92CrossRef Choi TM, Chan HK, Yue X (2017) Recent development in big data analytics for business operations and risk management. IEEE Trans Cybern 47(1):81–92CrossRef
go back to reference Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668CrossRef Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668CrossRef
go back to reference Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRef
go back to reference Dong C, Tian L (2012) Accelerating relevance-vector-machine-based classification of hyperspectral image with parallel computing. Math Problems Eng Dong C, Tian L (2012) Accelerating relevance-vector-machine-based classification of hyperspectral image with parallel computing. Math Problems Eng
go back to reference Grolinger K, Hayes M, Higashino W A, L’Heureux A, Allison DS, Capretz MA (2014) Challenges for mapreduce in big data. In: IEEE world congress on services (SERVICES). IEEE, pp 182–189 Grolinger K, Hayes M, Higashino W A, L’Heureux A, Allison DS, Capretz MA (2014) Challenges for mapreduce in big data. In: IEEE world congress on services (SERVICES). IEEE, pp 182–189
go back to reference Koh JL, Chen CC, Chan CY, Chen AL (2017) MapReduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137CrossRef Koh JL, Chen CC, Chan CY, Chen AL (2017) MapReduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137CrossRef
go back to reference Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems, pp 231–238 Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems, pp 231–238
go back to reference Kumar A, Shankar R, Choudhary A, Thakur LS (2016) A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int J Prod Res 54(23):7060–7073CrossRef Kumar A, Shankar R, Choudhary A, Thakur LS (2016) A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int J Prod Res 54(23):7060–7073CrossRef
go back to reference Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207CrossRef Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207CrossRef
go back to reference Lei Y, Ding X, Wang S (2008) Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans Syst Man Cybern Part B (Cybern) 38(6):1578–1591CrossRef Lei Y, Ding X, Wang S (2008) Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans Syst Man Cybern Part B (Cybern) 38(6):1578–1591CrossRef
go back to reference Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795CrossRef Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795CrossRef
go back to reference Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:1408.2041 Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:​1408.​2041
go back to reference Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241MathSciNetMATH Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241MathSciNetMATH
go back to reference Palit I, Reddy CK (2012) Scalable and parallel boosting with mapreduce. IEEE Trans Knowl Data Eng 24(10):1904–1916CrossRef Palit I, Reddy CK (2012) Scalable and parallel boosting with mapreduce. IEEE Trans Knowl Data Eng 24(10):1904–1916CrossRef
go back to reference Seeger M, Williams C, Lawrence N (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Artificial intelligence and statistics 9 (No. EPFL-CONF-161318) Seeger M, Williams C, Lawrence N (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Artificial intelligence and statistics 9 (No. EPFL-CONF-161318)
go back to reference Silva C, Ribeiro B (2008) Towards expanding relevance vector machines to large scale datasets. Int J Neural Syst 18(01):45–58CrossRef Silva C, Ribeiro B (2008) Towards expanding relevance vector machines to large scale datasets. Int J Neural Syst 18(01):45–58CrossRef
go back to reference Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10 Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
go back to reference Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems, pp 619–625 Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems, pp 619–625
go back to reference Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271CrossRef Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271CrossRef
go back to reference Tashk ARB, Sayadiyan A, Valiollahzadeh S (2007) Face detection using adaboosted RVM-based component classifier. In: 5th International symposium on image and signal processing and analysis, ISPA 2007. IEEE, pp 351–355 Tashk ARB, Sayadiyan A, Valiollahzadeh S (2007) Face detection using adaboosted RVM-based component classifier. In: 5th International symposium on image and signal processing and analysis, ISPA 2007. IEEE, pp 351–355
go back to reference Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244MathSciNetMATH Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244MathSciNetMATH
go back to reference Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS
go back to reference Yang D, Liang G, Jenkins DD, Peterson GD, Li H (2010) High performance relevance vector machine on GPUs. In: Symposium on application accelerators in high performance computing Yang D, Liang G, Jenkins DD, Peterson GD, Li H (2010) High performance relevance vector machine on GPUs. In: Symposium on application accelerators in high performance computing
go back to reference Yu Y, Li YF, Zhou ZH (2011) July) Diversity regularized machine. IJCAI Proc Int Joint Conf Artif Intell 22(1):1603 Yu Y, Li YF, Zhou ZH (2011) July) Diversity regularized machine. IJCAI Proc Int Joint Conf Artif Intell 22(1):1603
go back to reference Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix conference on networked systems design and implementation, vol 70. USENIX Association, p 2 Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix conference on networked systems design and implementation, vol 70. USENIX Association, p 2
go back to reference Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Ghodsi A (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65CrossRef Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Ghodsi A (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65CrossRef
Metadata
Title
A distributed ensemble of relevance vector machines for large-scale data sets on Spark
Authors
Wangchen Qin
Fang Liu
Mi Tong
Zhengying Li
Publication date
08-03-2021
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 10/2021
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-021-05671-y

Other articles of this Issue 10/2021

Soft Computing 10/2021 Go to the issue

Premium Partner