demonstration

Shark: fast data analysis using coarse-grained distributed memory

Authors:
Cliff Engle

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Antonio Lupher

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Reynold Xin

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Matei Zaharia

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Michael J. Franklin

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Scott Shenker

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

,
Ion Stoica

University of California Berkeley, Berkeley, CA, USA

University of California Berkeley, Berkeley, CA, USA
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 689–692https://doi.org/10.1145/2213836.2213934

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 689–692

ABSTRACT

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.

References

G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. In HotOS '11, 2011. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178. ACM, 2009. Google ScholarDigital Library
A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.Google ScholarCross Ref
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI 2012. Google ScholarDigital Library

Index Terms

Shark: fast data analysis using coarse-grained distributed memory
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Spark SQL: Relational Data Processing in Spark
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...
Read More
Shark: SQL and rich analytics at scale
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...
Read More
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data warehouse
databases
machine learning
resilient distributed dataset
shark
spark
Qualifiers
- demonstration
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 85
  Total Citations
  View Citations
- 1,561
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spark SQL: Relational Data Processing in Spark

Shark: SQL and rich analytics at scale

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spark SQL: Relational Data Processing in Spark

Shark: SQL and rich analytics at scale

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media