skip to main content
research-article

Efficient big data processing in Hadoop MapReduce

Published:01 August 2012Publication History
Skip Abstract Section

Abstract

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.

References

  1. Hadoop, http://hadoop.apache.org/mapreduce/.Google ScholarGoogle Scholar
  2. D. Abadi et al. Column-Oriented Database Systems. PVDLB, 2(2):1664--1665, 2009. Google ScholarGoogle Scholar
  3. F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, pages 99--110, 2010. Google ScholarGoogle Scholar
  4. S. Babu. Towards automatic optimization of MapReduce programs. In SOCC, pages 137--142, 2010. Google ScholarGoogle Scholar
  5. S. Blanas et al. A Comparison of Join Algorithms for Log Processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarGoogle Scholar
  6. J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarGoogle Scholar
  7. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):519--529, 2010. Google ScholarGoogle Scholar
  8. J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB, 5, 2012. Google ScholarGoogle Scholar
  9. A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419--429, 2011. Google ScholarGoogle Scholar
  10. A. Gates et al. Building a HighLevel Dataflow System on Top of MapReduce: The Pig Experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarGoogle Scholar
  11. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, pages 29--43, 2003. Google ScholarGoogle Scholar
  12. H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarGoogle Scholar
  13. M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, pages 59--72, 2007. Google ScholarGoogle Scholar
  14. E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6):385--396, 2011. Google ScholarGoogle Scholar
  15. D. Jiang et al. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1--2):472--483, 2010. Google ScholarGoogle Scholar
  16. A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011. Google ScholarGoogle Scholar
  17. J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. MapReduce Workshop, 2011. Google ScholarGoogle Scholar
  18. Y. Lin et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In SIGMOD, pages 961--972, 2011. Google ScholarGoogle Scholar
  19. D. Logothetis et al. Stateful Bulk Processing for Incremental Analytics. In SoCC, pages 51--62, 2010. Google ScholarGoogle Scholar
  20. A. Okcan and M. Riedewald. Processing Theta-Joins Using MapReduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle Scholar
  21. A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarGoogle Scholar
  22. J.-A. Quiané-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. RAFTing MapReduce: Fast Recovery on the RAFT. ICDE, pages 589--600, 2011. Google ScholarGoogle Scholar
  23. A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. In SIGMOD, pages 1013--1020, 2010. Google ScholarGoogle Scholar
  24. A. Thusoo et al. Hive -- A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, pages 996--1005, 2010.Google ScholarGoogle Scholar
  25. S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query Optimization for Massively Parallel Data Processing. In SOCC, 2011. Google ScholarGoogle Scholar
  26. M. Zaharia et al. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, pages 29--42, 2008. Google ScholarGoogle Scholar

Index Terms

  1. Efficient big data processing in Hadoop MapReduce
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 5, Issue 12
        August 2012
        340 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2012
        Published in pvldb Volume 5, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader