ABSTRACT
MapReduce has become increasingly popular as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop, written in Java. There is considerable interest, however, in tools that enable Python programmers to access the framework, due to the language's high popularity. Here we present a Python package that provides an API for both the MapReduce and the distributed file system sections of Hadoop, and show its advantages with respect to the other available solutions for Hadoop Python programming, Jython and Hadoop Streaming.
- }}Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.Google Scholar
- }}Applications and organizations using hadoop. http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- }}Disco. http://discoproject.org.Google Scholar
- }}Dumbo. http://wiki.github.com/klbostee/dumbo.Google Scholar
- }}Hadoop. http://hadoop.apache.org.Google Scholar
- }}Hadoop + Python = Happy. http://code.google.com/p/happy.Google Scholar
- }}Hadoop Common Credits. http://hadoop.apache.org/common/credits.html.Google Scholar
- }}Hadoop Distributed File System (HDFS) APIs in perl, python, ruby and php. http://wiki.apache.org/hadoop/HDFS-APIs.Google Scholar
- }}Kevin's Word List Page. http://wordlist.sourceforge.net.Google Scholar
- }}NumPy. http://numpy.scipy.org.Google Scholar
- }}Octopy -- Easy MapReduce for Python. http://code.google.com/p/octopy.Google Scholar
- }}Starfish. http://rufy.com/starfish/doc.Google Scholar
- }}The Jython Project. http://www.jython.org.Google Scholar
- }}Thrift. http://incubator.apache.org/thrift.Google Scholar
- }}D. Abrahams and R. Grosse-Kunstleve. Building hybrid systems with Boost. Python. C/C++ Users Journal, 21(7):29--36, 2003.Google Scholar
- }}J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI '04: 6th Symposium on Operating Systems Design and Implementation, 2004. Google ScholarDigital Library
- }}S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):43, 2003. Google ScholarDigital Library
- }}S. Leo, P. Anedda, M. Gaggero, and G. Zanetti. Using virtual clusters to decouple computation and data management in high throughput analysis applications. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, Italy, 17--19 February 2010, pages 411--415, 2010. Google ScholarDigital Library
Index Terms
- Pydoop: a Python MapReduce and HDFS API for Hadoop
Comments