Skip to main content
Top

2013 | OriginalPaper | Chapter

16. Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing

Authors : Frederic Stahl, Mohamed Medhat Gaber, Max Bramer

Published in: Business Intelligence and Performance Management

Publisher: Springer London

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Basilico, J.D., Munson, M.A., Kolda, T.G., Dixon, K.R., Kegelmeyer, W.P.: Comet: a recipe for learning and using large ensembles on massive data. CoRR, abs/1103.2068 (2011) Basilico, J.D., Munson, M.A., Kolda, T.G., Dixon, K.R., Kegelmeyer, W.P.: Comet: a recipe for learning and using large ensembles on massive data. CoRR, abs/1103.2068 (2011)
3.
go back to reference Berrar, D., Stahl, F., Goncalves Silva, C.S., Rodrigues, J.R., Brito, R.M.M.: Towards data warehousing and mining of protein unfolding simulation data. J. Clin. Monit. Comput. 19, 307–317 (2005) CrossRef Berrar, D., Stahl, F., Goncalves Silva, C.S., Rodrigues, J.R., Brito, R.M.M.: Towards data warehousing and mining of protein unfolding simulation data. J. Clin. Monit. Comput. 19, 307–317 (2005) CrossRef
4.
go back to reference Bhaduri, K., Das, K., Liu, K., Kargupta, H., Ryan, J.: Distributed data mining bibliography (2008) Bhaduri, K., Das, K., Liu, K., Kargupta, H., Ryan, J.: Distributed data mining bibliography (2008)
5.
go back to reference Bramer, M.A.: Automatic induction of classification rules from examples using N-prism. In: Research and Development in Intelligent Systems XVI, pp. 99–121. Springer, Cambridge (2000) CrossRef Bramer, M.A.: Automatic induction of classification rules from examples using N-prism. In: Research and Development in Intelligent Systems XVI, pp. 99–121. Springer, Cambridge (2000) CrossRef
6.
go back to reference Bramer, M.A.: An information-theoretic approach to the pre-pruning of classification rules. In: Neumann, B., Musen, M., Studer, R. (eds.) Intelligent Information Processing, pp. 201–212. Kluwer Academic, Dordrecht (2002) CrossRef Bramer, M.A.: An information-theoretic approach to the pre-pruning of classification rules. In: Neumann, B., Musen, M., Studer, R. (eds.) Intelligent Information Processing, pp. 201–212. Kluwer Academic, Dordrecht (2002) CrossRef
7.
go back to reference Brezany, P., Janciak, I., Tjoa, A.M.: GridMiner: An Advanced Support for E-Science Analytics, pp. 37–55. Wiley, New York (2009) Brezany, P., Janciak, I., Tjoa, A.M.: GridMiner: An Advanced Support for E-Science Analytics, pp. 37–55. Wiley, New York (2009)
8.
go back to reference Cardona, K., Secretan, J., Georgiopoulos, M., Anagnostopoulos: A grid based system for data mining using mapreduce. Technical report, AMALTHEA TR-2007-02 (2007) Cardona, K., Secretan, J., Georgiopoulos, M., Anagnostopoulos: A grid based system for data mining using mapreduce. Technical report, AMALTHEA TR-2007-02 (2007)
9.
go back to reference Celis, S., Musicant, D.R.: Weka-parallel: machine learning in parallel. Technical report, Carleton College, CS TR (2002) Celis, S., Musicant, D.R.: Weka-parallel: machine learning in parallel. Technical report, Carleton College, CS TR (2002)
10.
go back to reference Cendrowska, J.: PRISM: an algorithm for inducing modular rules. Int. J. Man-Mach. Stud. 27(4), 349–370 (1987) MATHCrossRef Cendrowska, J.: PRISM: an algorithm for inducing modular rules. Int. J. Man-Mach. Stud. 27(4), 349–370 (1987) MATHCrossRef
11.
go back to reference Chambers, L., Tromp, E., Pechenizkiy, M., Gaber, M.: Mobile sentiment analysis. In: Proceedings of the 16th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, September (2012) Chambers, L., Tromp, E., Pechenizkiy, M., Gaber, M.: Mobile sentiment analysis. In: Proceedings of the 16th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, September (2012)
12.
go back to reference Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) NIPS, pp. 281–288. MIT Press, Cambridge (2006) Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) NIPS, pp. 281–288. MIT Press, Cambridge (2006)
13.
go back to reference Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989) Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
14.
go back to reference Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Mateo (1995) Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Mateo (1995)
15.
go back to reference Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 271–280. ACM, New York (2007) CrossRef Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 271–280. ACM, New York (2007) CrossRef
16.
go back to reference Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) CrossRef
17.
go back to reference Gaber, M.: Data stream mining using granularity-based approach. Found. Comput. Intell. 6, 47–66 (2009) CrossRef Gaber, M.: Data stream mining using granularity-based approach. Found. Comput. Intell. 6, 47–66 (2009) CrossRef
18.
go back to reference Gaber, M.: Foundations of adaptive data stream mining for mobile and embedded applications. In: Cairo International Biomedical Engineering Conference. CIBEC 2008, December, pp. 1–6. IEEE, Piscataway (2008). doi:10.1109/CIBEC.2008.4786099 Gaber, M.: Foundations of adaptive data stream mining for mobile and embedded applications. In: Cairo International Biomedical Engineering Conference. CIBEC 2008, December, pp. 1–6. IEEE, Piscataway (2008). doi:10.​1109/​CIBEC.​2008.​4786099
19.
go back to reference Gaber, M.M., Röhm, U., Herink, K.: An analytical study of central and in-network data processing for wireless sensor networks. Inf. Process. Lett. 110(2), 62–70 (2009) MATHCrossRef Gaber, M.M., Röhm, U., Herink, K.: An analytical study of central and in-network data processing for wireless sensor networks. Inf. Process. Lett. 110(2), 62–70 (2009) MATHCrossRef
20.
go back to reference Gaber, M.M., Yu, P.S.: A holistic approach for resource-aware adaptive data stream mining. New Gener. Comput. 25(1), 95–115 (2006) CrossRef Gaber, M.M., Yu, P.S.: A holistic approach for resource-aware adaptive data stream mining. New Gener. Comput. 25(1), 95–115 (2006) CrossRef
21.
22.
go back to reference Gantz, J., Reinsel, D.: The digital universe decade, are you ready? IDC 2009(May), 1–16 (2010) Gantz, J., Reinsel, D.: The digital universe decade, are you ready? IDC 2009(May), 1–16 (2010)
23.
go back to reference Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pp. 29–43. ACM, New York (2003) CrossRef Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pp. 29–43. ACM, New York (2003) CrossRef
24.
26.
go back to reference Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (cuda). J. Supercomput., pp. 1–26. doi:10.1007/s11227-011-0672-7 Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (cuda). J. Supercomput., pp. 1–26. doi:10.​1007/​s11227-011-0672-7
27.
go back to reference Keim, D.A., Mansmann, F., Schneidewind, J., Ziegler, H.: Challenges in visual data analysis. In: Proceedings of the Conference on Information Visualization, IV ’06, pp. 9–16. IEEE Comput. Soc., Washington (2006) Keim, D.A., Mansmann, F., Schneidewind, J., Ziegler, H.: Challenges in visual data analysis. In: Proceedings of the Conference on Information Visualization, IV ’06, pp. 9–16. IEEE Comput. Soc., Washington (2006)
28.
go back to reference Krishnaswamy, S., Gaber, M., Harbach, M., Hugues, C., Sinha, A., Gillick, B., Haghighi, P., Zaslavsky, A.: Open mobile miner: a toolkit for mobile data stream mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June (2009) Krishnaswamy, S., Gaber, M., Harbach, M., Hugues, C., Sinha, A., Gillick, B., Haghighi, P., Zaslavsky, A.: Open mobile miner: a toolkit for mobile data stream mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June (2009)
29.
go back to reference Kumar, A., Kantardzic, M., Madden, S.: Guest editors’ introduction: distributed data mining–framework and implementations. IEEE Internet Comput. 10(4), 15–17 (2006) CrossRef Kumar, A., Kantardzic, M., Madden, S.: Guest editors’ introduction: distributed data mining–framework and implementations. IEEE Internet Comput. 10(4), 15–17 (2006) CrossRef
30.
go back to reference Liu, T., Rosenberg, C., Rowley, H.A.: Clustering billions of images with large scale nearest neighbor search. In: Proceedings of the Eighth IEEE Workshop on Applications of Computer Vision, WACV ’07, p. 28. IEEE Comput. Soc., Washington (2007) Liu, T., Rosenberg, C., Rowley, H.A.: Clustering billions of images with large scale nearest neighbor search. In: Proceedings of the Eighth IEEE Workshop on Applications of Computer Vision, WACV ’07, p. 28. IEEE Comput. Soc., Washington (2007)
31.
go back to reference Luo, P., Lü, K., Shi, Z., He, Q.: Distributed data mining in grid computing environments. Future Gener. Comput. Syst. 23(1), 84–91 (2007) CrossRef Luo, P., Lü, K., Shi, Z., He, Q.: Distributed data mining in grid computing environments. Future Gener. Comput. Syst. 23(1), 84–91 (2007) CrossRef
32.
go back to reference Nolle, L., Wong, K.C.P., Hopgood, A.: DARBS: a distributed blackboard system. In: Proceedings of the Twenty-First SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence. Springer, Cambridge (2001) Nolle, L., Wong, K.C.P., Hopgood, A.: DARBS: a distributed blackboard system. In: Proceedings of the Twenty-First SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence. Springer, Cambridge (2001)
33.
go back to reference Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow. 2, 1426–1437 (2009) Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow. 2, 1426–1437 (2009)
34.
go back to reference Compare Business Products. The 10 largest data bases in the world (2012) Compare Business Products. The 10 largest data bases in the world (2012)
35.
go back to reference Human Genome Project. Human genome project information (2012) Human Genome Project. Human genome project information (2012)
36.
go back to reference Quinlan, R.J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986) Quinlan, R.J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
37.
go back to reference Rings, T., Caryer, G., Gallop, J.R., Grabowski, J., Kovacikova, T., Schulz, S., Stokes-Rees, I.: Grid and cloud computing: opportunities for integration with the next generation network. J. Grid Comput. 7(3), 375–393 (2009) CrossRef Rings, T., Caryer, G., Gallop, J.R., Grabowski, J., Kovacikova, T., Schulz, S., Stokes-Rees, I.: Grid and cloud computing: opportunities for integration with the next generation network. J. Grid Comput. 7(3), 375–393 (2009) CrossRef
38.
39.
go back to reference Shafer, J., Agrawal, R., Metha, M.: SPRINT: a scalable parallel classifier for data mining. In: Proc. of the 22nd Int’l Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Mateo (1996) Shafer, J., Agrawal, R., Metha, M.: SPRINT: a scalable parallel classifier for data mining. In: Proc. of the 22nd Int’l Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Mateo (1996)
40.
go back to reference Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96, pp. 336–343. IEEE Comput. Soc., Washington (1996) CrossRef Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96, pp. 336–343. IEEE Comput. Soc., Washington (1996) CrossRef
41.
go back to reference Sirvastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of Decision-Tree classification algorithms. In: Data Mining and Knowledge Discovery, pp. 237–261 (1998) Sirvastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of Decision-Tree classification algorithms. In: Data Mining and Knowledge Discovery, pp. 237–261 (1998)
42.
go back to reference Srinivasan, M.K., Sarukesi, K., Rodrigues, P., Sai Manoj, M., Revathy, P.: State-of-the-art cloud computing security taxonomies: a classification of security challenges in the present cloud computing environment. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp. 470–476. ACM, New York (2012) Srinivasan, M.K., Sarukesi, K., Rodrigues, P., Sai Manoj, M., Revathy, P.: State-of-the-art cloud computing security taxonomies: a classification of security challenges in the present cloud computing environment. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp. 470–476. ACM, New York (2012)
44.
go back to reference Stahl, F., Bramer, M., Adda, M.: Pmcri: A parallel modular classification rule induction framework. In: Machine Learning and Data Mining in Pattern Recognition, pp. 148–162 (2009) CrossRef Stahl, F., Bramer, M., Adda, M.: Pmcri: A parallel modular classification rule induction framework. In: Machine Learning and Data Mining in Pattern Recognition, pp. 148–162 (2009) CrossRef
45.
go back to reference Stahl, F., Bramer, M.: Random prism: an alternative to random forests. In: Thirty-First SGAI International Conference on Artificial Intelligence, Cambridge, England, pp. 5–18 (2011) Stahl, F., Bramer, M.: Random prism: an alternative to random forests. In: Thirty-First SGAI International Conference on Artificial Intelligence, Cambridge, England, pp. 5–18 (2011)
46.
go back to reference Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the pmcri and j-pmcri frameworks. In: Knowledge-Based Systems (2012) Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the pmcri and j-pmcri frameworks. In: Knowledge-Based Systems (2012)
47.
go back to reference Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Rohm, M., Trnkoczy, J., May, M., Franke, J., Schuster, A., et al.: Digging Deep into the Data Mine with Datamininggrid (2008) Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Rohm, M., Trnkoczy, J., May, M., Franke, J., Schuster, A., et al.: Digging Deep into the Data Mine with Datamininggrid (2008)
48.
go back to reference Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Kindermann, J., Dubitzky, W.: Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener. Comput. Syst. 24(4), 259–279 (2008) CrossRef Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Kindermann, J., Dubitzky, W.: Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener. Comput. Syst. 24(4), 259–279 (2008) CrossRef
49.
go back to reference Sloan Digital Sky Survey. The sloan digital sky survey (2012) Sloan Digital Sky Survey. The sloan digital sky survey (2012)
50.
go back to reference Swain, M., Silva, C.G., Loureiro-Ferreira, N., Ostropytskyy, V., Brito, J., Riche, O., Stahl, F., Dubitzky, W., Brito, R.M.M.: P-found: grid-enabling distributed repositories of protein folding and unfolding simulations for data mining. Future Gener. Comput. Syst. 26(3), 424–433 (2010) CrossRef Swain, M., Silva, C.G., Loureiro-Ferreira, N., Ostropytskyy, V., Brito, J., Riche, O., Stahl, F., Dubitzky, W., Brito, R.M.M.: P-found: grid-enabling distributed repositories of protein folding and unfolding simulations for data mining. Future Gener. Comput. Syst. 26(3), 424–433 (2010) CrossRef
52.
go back to reference Witten, I.H., Eibe, F.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Mateo (2005) Witten, I.H., Eibe, F.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Mateo (2005)
53.
go back to reference Wu, G., Li, H., Hu, X., Bi, Y., Zhang, J., Wu, X.: Mrec4.5: C4.5 ensemble classification with mapreduce. In: Fourth ChinaGrid Annual Conference, ChinaGrid ’09, pp. 249–255 (2009) CrossRef Wu, G., Li, H., Hu, X., Bi, Y., Zhang, J., Wu, X.: Mrec4.5: C4.5 ensemble classification with mapreduce. In: Fourth ChinaGrid Annual Conference, ChinaGrid ’09, pp. 249–255 (2009) CrossRef
54.
go back to reference Zhao, Q., Sun, J., Yu, C., Xiao, J., Cui, C., Zhang, X.: Improved parallel processing function for high-performance large-scale astronomical cross-matching. Transact. Tianjin Univ. 17, 62–67 (2011) CrossRef Zhao, Q., Sun, J., Yu, C., Xiao, J., Cui, C., Zhang, X.: Improved parallel processing function for high-performance large-scale astronomical cross-matching. Transact. Tianjin Univ. 17, 62–67 (2011) CrossRef
55.
go back to reference Zliobaite, I., Bifet, A., Gaber, M., Gabrys, B., Gama, J., Minku, L., Musial, K.: Next challenges for adaptive learning systems. SIGKDD Explorations Newsletter 14(1) (2012) Zliobaite, I., Bifet, A., Gaber, M., Gabrys, B., Gama, J., Minku, L., Musial, K.: Next challenges for adaptive learning systems. SIGKDD Explorations Newsletter 14(1) (2012)
Metadata
Title
Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing
Authors
Frederic Stahl
Mohamed Medhat Gaber
Max Bramer
Copyright Year
2013
Publisher
Springer London
DOI
https://doi.org/10.1007/978-1-4471-4866-1_16

Premium Partner