skip to main content
research-article

Data Lifecycle Challenges in Production Machine Learning: A Survey

Published:11 December 2018Publication History
Skip Abstract Section

Abstract

Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus - data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.

References

  1. Deep learning for detection of diabetic eye disease. https://research.googleblog.com/2016/11/ deep-learning-for-detection-of-diabetic.html.Google ScholarGoogle Scholar
  2. Kaggle. https://www.kaggle.com/.Google ScholarGoogle Scholar
  3. Keras. https://keras.io/.Google ScholarGoogle Scholar
  4. Mxnet. https://mxnet.incubator.apache.org/.Google ScholarGoogle Scholar
  5. Tensorflow. https://www.tensorflow.org/.Google ScholarGoogle Scholar
  6. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, pages 29--42, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. R´e, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.Google ScholarGoogle Scholar
  8. M. R. Anderson and M. J. Cafarella. Input selection for fast feature engineering. In ICDE, pages 577--588, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich. Tfx: A tensorflow-based production-scale machine learning platform. In SIGKDD, pages 1387--1395, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798--1828, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.Google ScholarGoogle Scholar
  14. C. Binnig, L. D. Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.Google ScholarGoogle Scholar
  15. M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. PVLDB, 9(13):1425--1436, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J.-H. B¨ose, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic demand forecasting at scale. PVLDB, 10(12):1694--1705, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  19. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. In PVLDB, pages 982--993, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In CIDR, 2015.Google ScholarGoogle Scholar
  25. V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. 33:1--28, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Dasgupta and J. Langford. Tutorial summary: Active learning. In ICML, page 18, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. PVLDB, 1(2):1542--1552, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.Google ScholarGoogle Scholar
  29. R. C. Fernandez, Z. Abedjan, S. Madden, and M. Stonebraker. Towards large-scale data discovery: Position paper. In ExploreDB, pages 3--5, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. A. Fisher. On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1:3--32, 1921.Google ScholarGoogle Scholar
  31. R. A. Fisher. Statistical Methods for Research Workers, pages 66--70. Springer New York, 1992.Google ScholarGoogle Scholar
  32. A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419--435, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  33. L. Golab, I. F. Ilyas, G. Beskales, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.Google ScholarGoogle Scholar
  35. A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In CIDR, 2017.Google ScholarGoogle Scholar
  37. A. Jenkinson. Beyond segmentation. Journal of Targeting, Measurement and Analysis for Marketing, (1):60--72, 1994.Google ScholarGoogle Scholar
  38. M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, pages 906--917, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  39. M. Kahng, D. Fang, and D. H. P. Chau. Visual exploration of machine learning results using data cube analysis. In HILDA, pages 1:1--1:6, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quian´e-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists in software teams: State of the art and challenges. TSE, PP(99):1--1, 2017.Google ScholarGoogle Scholar
  42. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Konda, A. Kumar, C. R´e, and V. Sashikanth. Feature selection in enterprise analytics: A demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google ScholarGoogle Scholar
  45. S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Rec., 44(4):17--22, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. H. Miao, A. Chavan, and A. Deshpande. Provdb: A system for lifecycle management of collaborative analysis workflows. CoRR, abs/1610.04963, 2016.Google ScholarGoogle Scholar
  50. H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, pages 571--582, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  51. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google ScholarGoogle Scholar
  52. F. Olsson. A literature survey of active machine learning in the context of natural language processing. volume T2009 of SICS Technical Report. Swedish Institute of Computer Science, 2009.Google ScholarGoogle Scholar
  53. C. Olston and B. Reed. Inspector gadget: A framework for custom monitoring and debugging of distributed dataflows. In SIGMOD, pages 1221--1224, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. A common runtime for high performance data analysis. In CIDR, 2017.Google ScholarGoogle Scholar
  55. K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11--28. Springer New York, 1992.Google ScholarGoogle Scholar
  56. A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. R´e. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. C. R´e, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Eng. Bull., 37(3):26--40, 2014.Google ScholarGoogle Scholar
  59. A. Romei and S. Ruggieri. A multidisciplinary survey on discrimination analysis. Knowledge Eng. Review, 29(5):582--638, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  60. G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, pages 531--540, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In Workshop on ML Systems at NIPS 2017, 2017.Google ScholarGoogle Scholar
  62. D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2503--2511, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. V. Shah, A. Kumar, and X. Zhu. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? PVLDB, 11(3):366--379, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In SIGKDD, pages 614--622, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Effortless data exploration with zenvisage: An expressive and interactive visual analytics system. PVLDB, 10(4):457--468, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  68. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  69. M. Vartak. MODELDB: A system for machine learning model management. In CIDR, 2017.Google ScholarGoogle Scholar
  70. M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  72. X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. C. Zhang. DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, 2015.Google ScholarGoogle Scholar
  74. C. Zhang, A. Kumar, and C. R´e. Materialization optimizations for feature selection workloads. ACM TODS, 41(1):2:1--2:32, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In SIGMOD, pages 527--540, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Lifecycle Challenges in Production Machine Learning: A Survey
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader