ABSTRACT
We survey the recent wave of extensions to the popular map-reduce systems, including those that have begun to address the implementation of recursive queries using the same computing environment as map-reduce. A central problem is that recursive tasks cannot deliver their output only at the end, which makes recovery from failures much more complicated than in map-reduce and its nonrecursive extensions. We propose several algorithmic ideas for efficient implementation of recursions in the map-reduce environment and discuss several alternatives for supporting recovery from failures without restarting the entire job.
- F. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. Ullman. Cluster computing, recursion and datalog. To appear in a book based on the Datalog 2.0 Workshop (March, 2010), Oxford GB, to be published by Springer in 2011. Google ScholarDigital Library
- F. N. Afrati and C. H. Papadimitriou. The parallel complexity of simple chain queries. In PODS, 1987. Google ScholarDigital Library
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010. Google ScholarDigital Library
- Apache. Hadoop. http://hadoop.apache.org/, 2006.Google Scholar
- Apache. Hdfs. http://hadoop.apache.org/hdfs/, 2008.Google Scholar
- Apache. Hive. http://wiki.apache.org/hadoop/Hive, 2008.Google Scholar
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing, pages 119--130, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the IEEE International Conference on Data Engineering, to appear, 2011. Google ScholarDigital Library
- A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. Google ScholarDigital Library
- A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web. Computer Networks, 33(1--6):309--320, 2000. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. Ernst. Haloop: efficient iterative data processing on large clusters. In VLDB Conference, 2010. Google ScholarDigital Library
- R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. In VLDB, 2008. Google ScholarDigital Library
- S. Dar and R. Ramakrishnan. A performance study of transitive closure algorithms. In SIGMOD Conference, pages 454--465, 1994. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, J. Royalty, S. Shankar, and A. Krioukov. Clustera: an integrated computation and data management system. PVLDB, 1(1):28--41, 2008. Google ScholarDigital Library
- S. Ganguly, A. Silberschatz, and S. Tsur. A framework for the parallel processing of datalog queries. SIGMOD Rec., 19:143--152, May 1990. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 19th ACM Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- J. M. Hellerstein. Datalog redux: experience and conjecture. In PODS, pages 1--2, 2010. Google ScholarDigital Library
- Y. E. Ioannidis. On the computation of the transitive closure of relational operators. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB '86, pages 403--411, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07, 2007. Google ScholarDigital Library
- R. Kabler, Y. E. Ioannidis, and M. J. Carey. Performance evaluation of algorithms for transitive closure. Inf. Syst., 17(5):415--441, 1992. Google ScholarDigital Library
- M. Lam and et al. Bdd-based deductive database. bddbddb.sourceforge.net, 2008.Google Scholar
- G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD Conference, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google ScholarDigital Library
- S.-W. Seong, M. Nasielski, J. Seo, D. Sengupta, S. Hangal, S. K. Teh, R. Chu, B. Dodson, and M. S. Lam. The architecture and implementation of a decentralized social networking platform. http://prpl.stanford.edu/papers/prpl09.pdf, 2009.Google Scholar
- T. Suen and J. Wong. Efficient task migration algorithm for distributed systems. Parallel and Distributed Systems, IEEE Transactions on, 3(4):488--499, jul. 1992. Google ScholarDigital Library
- J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume II. Computer Science Press, 1989. Google ScholarDigital Library
- J. D. Ullman and A. V. Gelder. Parallel complexity of logical query programs. In FOCS, 1986. Google ScholarDigital Library
- J. D. Ullman and J. Widom. A first course in database systems. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1997. Google ScholarDigital Library
- P. Valduriez and H. Boral. Evaluation of recursive queries using join indices. In Expert Database Conf., pages 271--293, 1986.Google Scholar
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, lfar Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library
Index Terms
- Map-reduce extensions and recursive queries
Recommendations
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Map-reduce-merge: simplified relational data processing on large clusters
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataMap-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model ...
Decidable containment of recursive queries
Database theoryOne of the most important reasoning tasks on queries is checking containment, i.e., verifying whether one query yields necessarily a subset of the result of another one. Query containment is crucial in several contexts, such as query optimization, query ...
Comments