Top

Published in:

2013 | OriginalPaper | Chapter

16. Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing

Authors : Frederic Stahl, Mohamed Medhat Gaber, Max Bramer

Published in: Business Intelligence and Performance Management

Publisher: Springer London

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Business Activity Monitoring (BAM)

next chapter Evolution of Business Intelligence

Amazon. Amazon web services, 2012

Basilico, J.D., Munson, M.A., Kolda, T.G., Dixon, K.R., Kegelmeyer, W.P.: Comet: a recipe for learning and using large ensembles on massive data. CoRR, abs/1103.2068 (2011)

Berrar, D., Stahl, F., Goncalves Silva, C.S., Rodrigues, J.R., Brito, R.M.M.: Towards data warehousing and mining of protein unfolding simulation data. J. Clin. Monit. Comput. 19, 307–317 (2005) CrossRef

Bhaduri, K., Das, K., Liu, K., Kargupta, H., Ryan, J.: Distributed data mining bibliography (2008)

Bramer, M.A.: Automatic induction of classification rules from examples using N-prism. In: Research and Development in Intelligent Systems XVI, pp. 99–121. Springer, Cambridge (2000) CrossRef

Bramer, M.A.: An information-theoretic approach to the pre-pruning of classification rules. In: Neumann, B., Musen, M., Studer, R. (eds.) Intelligent Information Processing, pp. 201–212. Kluwer Academic, Dordrecht (2002) CrossRef

Brezany, P., Janciak, I., Tjoa, A.M.: GridMiner: An Advanced Support for E-Science Analytics, pp. 37–55. Wiley, New York (2009)

Cardona, K., Secretan, J., Georgiopoulos, M., Anagnostopoulos: A grid based system for data mining using mapreduce. Technical report, AMALTHEA TR-2007-02 (2007)

Celis, S., Musicant, D.R.: Weka-parallel: machine learning in parallel. Technical report, Carleton College, CS TR (2002)

10.

Cendrowska, J.: PRISM: an algorithm for inducing modular rules. Int. J. Man-Mach. Stud. 27(4), 349–370 (1987) MATHCrossRef

11.

Chambers, L., Tromp, E., Pechenizkiy, M., Gaber, M.: Mobile sentiment analysis. In: Proceedings of the 16th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, September (2012)

12.

Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) NIPS, pp. 281–288. MIT Press, Cambridge (2006)

13.

Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)

14.

Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Mateo (1995)

15.

Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 271–280. ACM, New York (2007) CrossRef

16.

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) CrossRef

17.

Gaber, M.: Data stream mining using granularity-based approach. Found. Comput. Intell. 6, 47–66 (2009) CrossRef

18.

Gaber, M.: Foundations of adaptive data stream mining for mobile and embedded applications. In: Cairo International Biomedical Engineering Conference. CIBEC 2008, December, pp. 1–6. IEEE, Piscataway (2008). doi:10.1109/CIBEC.2008.4786099

19.

Gaber, M.M., Röhm, U., Herink, K.: An analytical study of central and in-network data processing for wireless sensor networks. Inf. Process. Lett. 110(2), 62–70 (2009) MATHCrossRef

20.

Gaber, M.M., Yu, P.S.: A holistic approach for resource-aware adaptive data stream mining. New Gener. Comput. 25(1), 95–115 (2006) CrossRef

21.

Gama, J.: Knowledge Discovery from Data Streams. Chapman & Hall/CRC, London (2010) MATHCrossRef

22.

Gantz, J., Reinsel, D.: The digital universe decade, are you ready? IDC 2009(May), 1–16 (2010)

23.

Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pp. 29–43. ACM, New York (2003) CrossRef

24.

Globus. The globus toolkit (2012)

25.

Hadoop. Hadoop mapreduce (2012). http://hadoop.apache.org/mapreduce/

26.

Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (cuda). J. Supercomput., pp. 1–26. doi:10.1007/s11227-011-0672-7

27.

Keim, D.A., Mansmann, F., Schneidewind, J., Ziegler, H.: Challenges in visual data analysis. In: Proceedings of the Conference on Information Visualization, IV ’06, pp. 9–16. IEEE Comput. Soc., Washington (2006)

28.

Krishnaswamy, S., Gaber, M., Harbach, M., Hugues, C., Sinha, A., Gillick, B., Haghighi, P., Zaslavsky, A.: Open mobile miner: a toolkit for mobile data stream mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June (2009)

29.

Kumar, A., Kantardzic, M., Madden, S.: Guest editors’ introduction: distributed data mining–framework and implementations. IEEE Internet Comput. 10(4), 15–17 (2006) CrossRef

30.

Liu, T., Rosenberg, C., Rowley, H.A.: Clustering billions of images with large scale nearest neighbor search. In: Proceedings of the Eighth IEEE Workshop on Applications of Computer Vision, WACV ’07, p. 28. IEEE Comput. Soc., Washington (2007)

31.

Luo, P., Lü, K., Shi, Z., He, Q.: Distributed data mining in grid computing environments. Future Gener. Comput. Syst. 23(1), 84–91 (2007) CrossRef

32.

Nolle, L., Wong, K.C.P., Hopgood, A.: DARBS: a distributed blackboard system. In: Proceedings of the Twenty-First SGES International Conference on Knowledge Based Systems and Applied Artificial Intelligence. Springer, Cambridge (2001)

33.

Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow. 2, 1426–1437 (2009)

34.

Compare Business Products. The 10 largest data bases in the world (2012)

35.

Human Genome Project. Human genome project information (2012)

36.

Quinlan, R.J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

37.

Rings, T., Caryer, G., Gallop, J.R., Grabowski, J., Kovacikova, T., Schulz, S., Stokes-Rees, I.: Grid and cloud computing: opportunities for integration with the next generation network. J. Grid Comput. 7(3), 375–393 (2009) CrossRef

38.

SETI@home. About seti@home (2012)

39.

Shafer, J., Agrawal, R., Metha, M.: SPRINT: a scalable parallel classifier for data mining. In: Proc. of the 22nd Int’l Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Mateo (1996)

40.

Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96, pp. 336–343. IEEE Comput. Soc., Washington (1996) CrossRef

41.

Sirvastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of Decision-Tree classification algorithms. In: Data Mining and Knowledge Discovery, pp. 237–261 (1998)

42.

Srinivasan, M.K., Sarukesi, K., Rodrigues, P., Sai Manoj, M., Revathy, P.: State-of-the-art cloud computing security taxonomies: a classification of security challenges in the present cloud computing environment. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp. 470–476. ACM, New York (2012)

43.

Stahl, F., Bramer, M.: Scaling up classification rule induction through parallel processing. Knowl. Eng. Rev. doi:10.1017/S0269888912000355 in press

44.

Stahl, F., Bramer, M., Adda, M.: Pmcri: A parallel modular classification rule induction framework. In: Machine Learning and Data Mining in Pattern Recognition, pp. 148–162 (2009) CrossRef

45.

Stahl, F., Bramer, M.: Random prism: an alternative to random forests. In: Thirty-First SGAI International Conference on Artificial Intelligence, Cambridge, England, pp. 5–18 (2011)

46.

Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the pmcri and j-pmcri frameworks. In: Knowledge-Based Systems (2012)

47.

Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Rohm, M., Trnkoczy, J., May, M., Franke, J., Schuster, A., et al.: Digging Deep into the Data Mine with Datamininggrid (2008)

48.

Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., Kindermann, J., Dubitzky, W.: Grid-enabling data mining applications with datamininggrid: an architectural perspective. Future Gener. Comput. Syst. 24(4), 259–279 (2008) CrossRef

49.

Sloan Digital Sky Survey. The sloan digital sky survey (2012)

50.

Swain, M., Silva, C.G., Loureiro-Ferreira, N., Ostropytskyy, V., Brito, J., Riche, O., Stahl, F., Dubitzky, W., Brito, R.M.M.: P-found: grid-enabling distributed repositories of protein folding and unfolding simulations for data mining. Future Gener. Comput. Syst. 26(3), 424–433 (2010) CrossRef

51.

Szalay, A.: The Evolving Universe. ASSL, vol. 231 (1998) CrossRef

52.

Witten, I.H., Eibe, F.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Mateo (2005)

53.

Wu, G., Li, H., Hu, X., Bi, Y., Zhang, J., Wu, X.: Mrec4.5: C4.5 ensemble classification with mapreduce. In: Fourth ChinaGrid Annual Conference, ChinaGrid ’09, pp. 249–255 (2009) CrossRef

54.

Zhao, Q., Sun, J., Yu, C., Xiao, J., Cui, C., Zhang, X.: Improved parallel processing function for high-performance large-scale astronomical cross-matching. Transact. Tianjin Univ. 17, 62–67 (2011) CrossRef

55.

Zliobaite, I., Bifet, A., Gaber, M., Gabrys, B., Gama, J., Minku, L., Musial, K.: Next challenges for adaptive learning systems. SIGKDD Explorations Newsletter 14(1) (2012)

Title: Scaling up Data Mining Techniques to Large Datasets Using Parallel and Distributed Processing
Authors: Frederic Stahl
Mohamed Medhat Gaber
Max Bramer
Publisher: Springer London
Book: Business Intelligence and Performance Management
Print ISBN: 978-1-4471-4865-4

Electronic ISBN: 978-1-4471-4866-1

Copyright Year: 2013
DOI: https://doi.org/10.1007/978-1-4471-4866-1_16

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner