ABSTRACT
We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains over one hundred processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.
- A. Ailamaki, N. K. Govindaraju, S. Harizopoulos, and D. Manocha. Query co-processing on commodity processors. In VLDB '06: Proceedings of the 32nd international conference on Very large data bases, pages 1267--1267. VLDB Endowment, 2006. Google ScholarDigital Library
- AMD CTM. http://ati.amd.com/products/streamprocessor/, 2007.Google Scholar
- Apache Hadoop. http://lucene.apache.org/hadoop/, 2006.Google Scholar
- D. Blythe. The direct3d 10 system. ACM Trans. Graph., 25(3):724--734, 2006. Google ScholarDigital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777--786, 2004. Google ScholarDigital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. A map reduce framework for programming graphics processors. In Workshop on Software Tools for MultiCore Systems, 2008.Google Scholar
- H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS '07: Proceedings of Twenty-First Annual Conference on Neural Information Processing Systems. Neural Information Processing Systems Foundation, 2007.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Feng, S. Chakraborty, B. Schmidt, W. Liu, and U. D. Bordoloi. Fast schedulability analysis using commodity graphics hardware. 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, 2007. Google ScholarDigital Library
- Folding@home. http://www.scei.co.jp/folding, 2007.Google Scholar
- N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- N. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 215--226, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. Cudpp: Cuda data parallel primitives library. 2007.Google Scholar
- B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 511--524, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- D. Horn. Lib GPU FFT, 2006.Google Scholar
- C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- E. S. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. In Supercomputing '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pages 55--55, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, pages 287--296, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- W. Liu, B. Schmidt, G. Voss, and W. Wittig. Streaming algorithms for biological sequence alignment on gpus. IEEE Transactions on Parallel and Distributed Systems, 18:1270--1281, 2007. Google ScholarDigital Library
- H. Nguyen. GPU gems 3. Addison-Wesley, 2008. Google ScholarDigital Library
- NVIDIA Corp. . CUDA Occupancy Calculator, 2007.Google Scholar
- NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html, 2007.Google Scholar
- OpenGL. http://www.opengl.org/, 2007.Google Scholar
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26, 2007.Google Scholar
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pages 97--106, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association. Google ScholarDigital Library
- SETI@home. http://setiathome.berkeley.edu/, 2007.Google Scholar
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. SIGOPS Oper. Syst. Rev., 40(5):325--335, 2006. Google ScholarDigital Library
- R. Yates and B. Neto. Modern information retrieval. Addison Wesley, 1 edition, 1999. Google ScholarDigital Library
Index Terms
- Mars: a MapReduce framework on graphics processors
Recommendations
Mars: Accelerating MapReduce with Graphics Processors
We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data processing on ...
Clustering billions of data points using GPUs
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshopIn this paper, we report our research on using GPUs to accelerate clustering of very large data sets, which are common in today's real world applications. While many published works have shown that GPUs can be used to accelerate various general purpose ...
GPU-accelerated predicate evaluation on column store
WAIM'10: Proceedings of the 11th international conference on Web-age information managementColumn scan, or predicate evaluation and filtering over a column of data in a database table, is an important primitive for data mining and data warehousing. In this paper, we present our study on accelerating column scan using a massively parallel ...
Comments