research-article

Data Lifecycle Challenges in Production Machine Learning: A Survey

Authors:
Neoklis Polyzotis

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Sudip Roy

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Steven Euijong Whang

KAIST, Daejeon, South Korea

KAIST, Daejeon, South Korea
View Profile

,
Martin Zinkevich

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 47 Issue 2June 2018pp 17–28https://doi.org/10.1145/3299887.3299891

Published:11 December 2018Publication History

ACM SIGMOD Record

Abstract

Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.

References

Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html.Google Scholar
Kaggle. https://www.kaggle.com/.Google Scholar
Keras. https://keras.io/.Google Scholar
Mxnet. https://mxnet.incubator.apache.org/.Google Scholar
Tensorflow. https://www.tensorflow.org/.Google Scholar
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, pages 29--42, 2013. Google ScholarDigital Library
M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. R´e, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.Google Scholar
M. R. Anderson and M. J. Cafarella. Input selection for fast feature engineering. In ICDE, pages 577--588, 2016.Google ScholarCross Ref
P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556, 2017. Google ScholarDigital Library
M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., 1993. Google ScholarDigital Library
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich. Tfx: A tensorflow-based production-scale machine learning platform. In SIGKDD, pages 1387--1395, 2017. Google ScholarDigital Library
Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798--1828, 2013. Google ScholarDigital Library
A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.Google Scholar
C. Binnig, L. D. Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.Google Scholar
M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. PVLDB, 9(13):1425--1436, 2016. Google ScholarDigital Library
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005. Google ScholarDigital Library
J.-H. B¨ose, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017. Google ScholarDigital Library
K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.Google ScholarCross Ref
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017. Google ScholarDigital Library
B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. In PVLDB, pages 982--993, 2005. Google ScholarDigital Library
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In CIDR, 2015.Google Scholar
V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. 33:1--28, 2014. Google ScholarDigital Library
S. Dasgupta and J. Langford. Tutorial summary: Active learning. In ICML, page 18, 2009. Google ScholarDigital Library
H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarDigital Library
F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.Google Scholar
R. C. Fernandez, Z. Abedjan, S. Madden, and M. Stonebraker. Towards large-scale data discovery: Position paper. In ExploreDB, pages 3--5, 2016. Google ScholarDigital Library
R. A. Fisher. On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1:3--32, 1921.Google Scholar
R. A. Fisher. Statistical Methods for Research Workers, pages 66--70. Springer New York, 1992.Google Scholar
A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419--435, 2002.Google ScholarCross Ref
L. Golab, I. F. Ilyas, G. Beskales, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarDigital Library
I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.Google Scholar
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016. Google ScholarDigital Library
J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In CIDR, 2017.Google Scholar
A. Jenkinson. Beyond segmentation. Journal of Targeting, Measurement and Analysis for Marketing, (1):60--72, 1994.Google Scholar
M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, pages 906--917, 2016.Google ScholarCross Ref
M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In HILDA, pages 1:1--1:6, 2016. Google ScholarDigital Library
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quian´e-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015. Google ScholarDigital Library
M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists in software teams: State of the art and challenges. TSE, PP(99):1--1, 2017.Google Scholar
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
P. Konda, A. Kumar, C. R´e, and V. Sashikanth. Feature selection in enterprise analytics: A demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013. Google ScholarDigital Library
T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google Scholar
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016. Google ScholarDigital Library
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Rec., 44(4):17--22, 2016. Google ScholarDigital Library
A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34, 2016. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarDigital Library
H. Miao, A. Chavan, and A. Deshpande. Provdb: A system for lifecycle management of collaborative analysis workflows. CoRR, abs/1610.04963, 2016.Google Scholar
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582, 2017.Google ScholarCross Ref
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google Scholar
F. Olsson. A literature survey of active machine learning in the context of natural language processing. volume T2009 of SICS Technical Report. Swedish Institute of Computer Science, 2009.Google Scholar
C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221--1224, 2011. Google ScholarDigital Library
S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. A common runtime for high performance data analysis. In CIDR, 2017.Google Scholar
K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11--28. Springer New York, 1992.Google Scholar
A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017. Google ScholarDigital Library
A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016. Google ScholarDigital Library
C. R´e, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.Google Scholar
A. Romei and S. Ruggieri. A multidisciplinary survey on discrimination analysis. Knowledge Eng. Review, 29(5):582--638, 2014.Google ScholarCross Ref
G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, pages 531--540, 2001. Google ScholarDigital Library
S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In Workshop on ML Systems at NIPS 2017, 2017.Google Scholar
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2503--2511, 2015. Google ScholarDigital Library
B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2012. Google ScholarDigital Library
V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? PVLDB, 11(3):366--379, 2017. Google ScholarDigital Library
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In SIGKDD, pages 614--622, 2008. Google ScholarDigital Library
T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: An expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016. Google ScholarDigital Library
E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.Google ScholarCross Ref
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
M. Vartak. MODELDB: A system for machine learning model management. In CIDR, 2017.Google Scholar
M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015. Google ScholarDigital Library
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.Google ScholarCross Ref
X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015. Google ScholarDigital Library
C. Zhang. DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, 2015.Google Scholar
C. Zhang, A. Kumar, and C. R´e. Materialization optimizations for feature selection workloads. ACM TODS, 41(1):2:1--2:32, 2016. Google ScholarDigital Library
Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, pages 527--540, 2017. Google ScholarDigital Library

Index Terms

Data Lifecycle Challenges in Production Machine Learning: A Survey
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Data Management in Machine Learning: Challenges, Techniques, and Systems
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ...
Read More
Data Management Challenges in Production Machine Learning
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, ...
Read More
Big data, lifelong machine learning and transfer learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

A major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 47, Issue 2
June 2018
68 pages
ISSN:0163-5808
DOI:10.1145/3299887
Editors:
Yanlei Diao
University of Massachusetts Amherst
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Zackary Ives
University of Pennsylvania
,
Anastasios Kementsietsidis
Google Research
,
Jeffrey Naughton
University of Wisconsin-Madison
,
Frank Neven
Hasselt University
,
Olga Papaemmanoui
Brandeis Univesity
,
Aditya Parameswaran
University of Illinois
,
Alkis Simitsis
HP Labs
,
Wang-Chiew Tan
University of California Santa Cruz
,
Pinar Tözü
IBM Almaden Research Center
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
Issue’s Table of Contents
Copyright © 2018 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 December 2018
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 117
  Total Citations
  View Citations
- 2,321
  Total Downloads
- Downloads (Last 12 months)388
- Downloads (Last 6 weeks)72
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Lifecycle Challenges in Production Machine Learning: A Survey

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Data Management in Machine Learning: Challenges, Techniques, and Systems

Data Management Challenges in Production Machine Learning

Big data, lifelong machine learning and transfer learning