Skip to main content
Erschienen in: Cluster Computing 3/2019

26.10.2018

Toward a new approach for sorting extremely large data files in the big data era

verfasst von: Ali Shatnawi, Yathrip AlZahouri, Mohammed A. Shehab, Yaser Jararweh, Mahmoud Al-Ayyoub

Erschienen in: Cluster Computing | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The extensive amount of data and contents generated today will require a paradigm shift in processing and management techniques for these data. One of the important data processing operations is the data sorting. Using multiple passes in external merge sort has a great influence on speeding up the sorting of extremely large data files. Since in large files, the swapping time is dominant in many applications, algorithms that minimize the swapping operations are normally superior to those which only focus on CPU time optimizations. In sorting extremely large files, external algorithms, such as the merge sort, are normally used. It is shown that using multiple passes over the data set, as proposed in our algorithm, has resulted in a great improvement in the number of swaps, thus, reducing the overall sorting time. Moreover, the proposed technique is suitable to be used with the emerging parallelization techniques such as GPUs. The reported results show the superiority of the proposed technique for “CPU only” and hybrid CPU–GPU implementations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The GitHub repository link will be provided here once the paper is accepted.
 
Literatur
1.
2.
Zurück zum Zitat Knuth, D.E.: The Art of Computer Programming, vol 3. Sorting and Searching, 2nd edn. Addison Wesley, Massachusetts (1998) Knuth, D.E.: The Art of Computer Programming, vol 3. Sorting and Searching, 2nd edn. Addison Wesley, Massachusetts (1998)
3.
Zurück zum Zitat Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)CrossRef Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)CrossRef
4.
Zurück zum Zitat John, L.H., David, A.P.: Computer Organization and Design (3rd): The Hardware/Software Interface. Morgan Kaufmann Publishers Inc, San Francisco, CA (2004) John, L.H., David, A.P.: Computer Organization and Design (3rd): The Hardware/Software Interface. Morgan Kaufmann Publishers Inc, San Francisco, CA (2004)
6.
Zurück zum Zitat Shatnawi, A., Alzahouri, Y.: A multi-pass algorithm for sorting extremely large data files. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 79–82. IEEE (2015) Shatnawi, A., Alzahouri, Y.: A multi-pass algorithm for sorting extremely large data files. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 79–82. IEEE (2015)
7.
Zurück zum Zitat Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading, MA (1989)MATH Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading, MA (1989)MATH
8.
Zurück zum Zitat Thomas, H.C., Charles, E.L., Ronald, L.R.: Introduction to Algorithms. McGraw-Hill, New York (1989) Thomas, H.C., Charles, E.L., Ronald, L.R.: Introduction to Algorithms. McGraw-Hill, New York (1989)
9.
10.
Zurück zum Zitat Shehab, M.A., Yaseen, Q., Al-Ayyoub, M., Albalas, F., Jararweh, Y.: Accelerating FCM-based text classification algorithm using GPUs. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC-2016), Boston, USA (2016) Shehab, M.A., Yaseen, Q., Al-Ayyoub, M., Albalas, F., Jararweh, Y.: Accelerating FCM-based text classification algorithm using GPUs. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC-2016), Boston, USA (2016)
11.
Zurück zum Zitat Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: The 8th International Conference on Information and Communication Systems, Irbid (2017) Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: The 8th International Conference on Information and Communication Systems, Irbid (2017)
12.
Zurück zum Zitat Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Jarrah, M.: Accelerating compute-intensive image segmentation algorithms using GPUs. J. Supercomput. 1, 1–23 (2016) Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Jarrah, M.: Accelerating compute-intensive image segmentation algorithms using GPUs. J. Supercomput. 1, 1–23 (2016)
13.
Zurück zum Zitat Cook, S., Programming, C.U.D.A.: A Developer’s Guide to Parallel Computing with GPUs. Morgan Kaufmann, San Francisco, CA (2012) Cook, S., Programming, C.U.D.A.: A Developer’s Guide to Parallel Computing with GPUs. Morgan Kaufmann, San Francisco, CA (2012)
14.
Zurück zum Zitat Sintorn, E., Assarsson, U.: Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel Distrib. Comput. 68(10), 1381–1388 (2008)CrossRefMATH Sintorn, E., Assarsson, U.: Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel Distrib. Comput. 68(10), 1381–1388 (2008)CrossRefMATH
15.
Zurück zum Zitat Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for many core GPUs. In IEEE International Symposium on Parallel & Distributed Processing, 2009 (IPDPS 2009), pp. 1–10. IEEE (2009) Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for many core GPUs. In IEEE International Symposium on Parallel & Distributed Processing, 2009 (IPDPS 2009), pp. 1–10. IEEE (2009)
17.
Zurück zum Zitat Ye, Y., Du, Z., Bader, D.A., Yang, Q., Huo, W.: GPUMemSort: a high performance graphics co-processors sorting algorithm for large scale in-memory data. GSTF J. Comput. 1(2), 23–27 (2018) Ye, Y., Du, Z., Bader, D.A., Yang, Q., Huo, W.: GPUMemSort: a high performance graphics co-processors sorting algorithm for large scale in-memory data. GSTF J. Comput. 1(2), 23–27 (2018)
18.
Zurück zum Zitat Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)CrossRef Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)CrossRef
19.
Zurück zum Zitat Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)CrossRef Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)CrossRef
20.
Zurück zum Zitat O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)CrossRefMATH O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)CrossRefMATH
21.
Zurück zum Zitat Islam, M.R., Uddin, S.M.R., Roy, C.: Computational complexities of the external sorting algorithm with no additional disk space. Int. J. Comput. Internet Manag. (IJCIM) 13(3), 60–68 (2005) Islam, M.R., Uddin, S.M.R., Roy, C.: Computational complexities of the external sorting algorithm with no additional disk space. Int. J. Comput. Internet Manag. (IJCIM) 13(3), 60–68 (2005)
22.
Zurück zum Zitat Islam, M.R., Nusrat, W., Hossain, M., Rana, S.M.M.: A new external sorting algorithm with no additional disk space with special in-place merging technique. In: International Conference on Computer and Information Technology (ICCIT), 26–28 December 2004; Dhaka, Bangladesh (2004) Islam, M.R., Nusrat, W., Hossain, M., Rana, S.M.M.: A new external sorting algorithm with no additional disk space with special in-place merging technique. In: International Conference on Computer and Information Technology (ICCIT), 26–28 December 2004; Dhaka, Bangladesh (2004)
23.
Zurück zum Zitat Islam, M.R., Adnan, N., Islam, N., Hossen, S.: A new external sorting algorithm with no additional disk space. Inf. Process. Lett. 86, 229–233 (2003)MathSciNetCrossRefMATH Islam, M.R., Adnan, N., Islam, N., Hossen, S.: A new external sorting algorithm with no additional disk space. Inf. Process. Lett. 86, 229–233 (2003)MathSciNetCrossRefMATH
24.
Zurück zum Zitat Agarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(8), 1116–1127 (1988)MathSciNetCrossRef Agarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(8), 1116–1127 (1988)MathSciNetCrossRef
25.
Zurück zum Zitat Dufrene, W.R., Lin, F.C.: An efficient sorting algorithm with no additional space. Comput. J. 35(3), 308–310 (1992)CrossRef Dufrene, W.R., Lin, F.C.: An efficient sorting algorithm with no additional space. Comput. J. 35(3), 308–310 (1992)CrossRef
27.
Zurück zum Zitat Zheng, L., Larson, P.-Å.: Speeding up external mergesort. IEEE Trans. Knowl. Data Eng. 8(2), 322–332 (1996)CrossRef Zheng, L., Larson, P.-Å.: Speeding up external mergesort. IEEE Trans. Knowl. Data Eng. 8(2), 322–332 (1996)CrossRef
28.
Zurück zum Zitat Zheng, L., Larson, P.-Å.: Buffering and read-ahead strategies for external merge sort. In: Proceedings of the International Conference on Very Large Databases, vol. 24, pp. 523–533 (1998) Zheng, L., Larson, P.-Å.: Buffering and read-ahead strategies for external merge sort. In: Proceedings of the International Conference on Very Large Databases, vol. 24, pp. 523–533 (1998)
29.
Zurück zum Zitat Yiannis, J., Zobel, J.: Compression techniques for fast external sorting. VLDB J. 16(2), 269–291 (2007)CrossRef Yiannis, J., Zobel, J.: Compression techniques for fast external sorting. VLDB J. 16(2), 269–291 (2007)CrossRef
30.
Zurück zum Zitat Verkamo, A.I.: Performance comparison of distributive and merge sort as external sorting algorithms. J. Syst. Softw. 10(3), 187–200 (1989)CrossRef Verkamo, A.I.: Performance comparison of distributive and merge sort as external sorting algorithms. J. Syst. Softw. 10(3), 187–200 (1989)CrossRef
31.
Zurück zum Zitat Nodine, M.H., Vitter, J.S.: Deterministic distribution sort in shared and distributed memory multiprocessors. In: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures 1993, Velen, June 30–July 02. Germany (1993) Nodine, M.H., Vitter, J.S.: Deterministic distribution sort in shared and distributed memory multiprocessors. In: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures 1993, Velen, June 30–July 02. Germany (1993)
32.
Zurück zum Zitat Cunto, W., Gonnet, G.H., Munro, J.I., Poblete, P.V.: Fringe analysis for extquick: an in situ distributive external sorting algorithm. Inf. Comput. 92(2), 141–160 (1991)MathSciNetCrossRefMATH Cunto, W., Gonnet, G.H., Munro, J.I., Poblete, P.V.: Fringe analysis for extquick: an in situ distributive external sorting algorithm. Inf. Comput. 92(2), 141–160 (1991)MathSciNetCrossRefMATH
34.
Zurück zum Zitat Wegner, L.M., Teuhola, J.I.: The external heapsort. IEEE Trans. Softw. Eng. 15(7), 917–925 (1989)CrossRef Wegner, L.M., Teuhola, J.I.: The external heapsort. IEEE Trans. Softw. Eng. 15(7), 917–925 (1989)CrossRef
35.
Zurück zum Zitat Samet, H.: Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison Wesley, Reading, MA (1990) Samet, H.: Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison Wesley, Reading, MA (1990)
36.
Zurück zum Zitat Arge, L., Vengroff, D.E., Vitter, J.S.: External-memory algorithms for processing line segments in geographic information systems. Algorithmica 47(1), 1–25 (2007)MathSciNetCrossRefMATH Arge, L., Vengroff, D.E., Vitter, J.S.: External-memory algorithms for processing line segments in geographic information systems. Algorithmica 47(1), 1–25 (2007)MathSciNetCrossRefMATH
37.
Zurück zum Zitat Lars, A.: External-memory algorithms with applications in GIS. In: van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P. (eds.) Algorithmic Foundations of Geographic Information Systems, pp. 213–254. Springer, Berlin (1996) (this book originated from the CISM Advanced School on the Algorithmic Foundations of Geographic Information Systems) Lars, A.: External-memory algorithms with applications in GIS. In: van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P. (eds.) Algorithmic Foundations of Geographic Information Systems, pp. 213–254. Springer, Berlin (1996) (this book originated from the CISM Advanced School on the Algorithmic Foundations of Geographic Information Systems)
38.
Zurück zum Zitat Won, K.: Introduction to Object-Oriented Databases. MIT Press, Cambridge (1990) Won, K.: Introduction to Object-Oriented Databases. MIT Press, Cambridge (1990)
39.
Zurück zum Zitat Funkhouser, T.A., Sequin, C.H., Teller, S.J.: Management of large amounts of data in interactive building walkthroughs. In: Proceedings of the 1992 Symposium on Interactive 3D Graphics, Cambridge, MA, I3D ‘92, pp. 11–20. ACM, New York (1992) Funkhouser, T.A., Sequin, C.H., Teller, S.J.: Management of large amounts of data in interactive building walkthroughs. In: Proceedings of the 1992 Symposium on Interactive 3D Graphics, Cambridge, MA, I3D ‘92, pp. 11–20. ACM, New York (1992)
43.
Zurück zum Zitat Paul, W.: Data Ware Housing. Elsevier, Amsterdam (2000) Paul, W.: Data Ware Housing. Elsevier, Amsterdam (2000)
44.
Zurück zum Zitat Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp. 145–152. IEEE (2011) Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp. 145–152. IEEE (2011)
45.
Zurück zum Zitat Souza, D.S., Santos, H.G., Coelho, I.M., Araujo, J.A.: A hybrid CPU-GPU scatter search for large-sized generalized assignment problems. In: International Conference on Computational Science and Its Applications, pp. 133–147. Springer, Cham (2017) Souza, D.S., Santos, H.G., Coelho, I.M., Araujo, J.A.: A hybrid CPU-GPU scatter search for large-sized generalized assignment problems. In: International Conference on Computational Science and Its Applications, pp. 133–147. Springer, Cham (2017)
46.
Zurück zum Zitat Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 12–17. IEEE (2017) Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 12–17. IEEE (2017)
47.
Zurück zum Zitat Nvidia.: “Nvidia Kepler GK110, Next-Generation Cuda Compute Architecture. Nvidia (2017) Nvidia.: “Nvidia Kepler GK110, Next-Generation Cuda Compute Architecture. Nvidia (2017)
48.
Zurück zum Zitat Alandoli, M., Al-Ayyoub, M., Al-Smadi, M., Jararweh, Y., Benkhelifa, E.: Using dynamic parallelism to speed up clustering-based community detection in social networks. In: IEEE International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), pp. 240–245. IEEE (2016) Alandoli, M., Al-Ayyoub, M., Al-Smadi, M., Jararweh, Y., Benkhelifa, E.: Using dynamic parallelism to speed up clustering-based community detection in social networks. In: IEEE International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), pp. 240–245. IEEE (2016)
Metadaten
Titel
Toward a new approach for sorting extremely large data files in the big data era
verfasst von
Ali Shatnawi
Yathrip AlZahouri
Mohammed A. Shehab
Yaser Jararweh
Mahmoud Al-Ayyoub
Publikationsdatum
26.10.2018
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 3/2019
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-018-2860-1

Weitere Artikel der Ausgabe 3/2019

Cluster Computing 3/2019 Zur Ausgabe

Premium Partner