skip to main content
research-article

Data Civilizer 2.0: a holistic framework for data preparation and analytics

Published:01 August 2019Publication History
Skip Abstract Section

Abstract

Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

References

  1. Apache Airflow, https://airflow.apache.org. Accessed: March 2019.Google ScholarGoogle Scholar
  2. mlflow. https://mlflow.org. Accessed: March 2019.Google ScholarGoogle Scholar
  3. R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Rec., 45(1):60--67, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google ScholarGoogle Scholar
  6. M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. D. Millstein, and M. Kim. Bigdebug: debugging primitives for interactive big data processing in spark. In ICSE, pages 784--795. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Konda, S. Das, P. S. G. C, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Mansour, D. Deng, R. C. Fernandez, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Building data civilizer pipelines with an advanced workflow engine. In ICDE, pages 1593--1596, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  9. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  11. W. Tao, X. Liu, Ç. Demiralp, R. Chang, and M. Stonebraker. Kyrix: Interactive visual data exploration at scale. In CIDR, 2019.Google ScholarGoogle Scholar
  12. M. Vartak, H. Subramanyam, W. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Modeldb: a system for machine learning model management. In HILDA, page 14, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Xin, L. Ma, J. Liu, S. Macke, S. Song, and A. G. Parameswaran. Helix: Accelerating human-in-the-loop machine learning. PVLDB, 11(12): 1958--1961, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Civilizer 2.0: a holistic framework for data preparation and analytics
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 12, Issue 12
      August 2019
      547 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2019
      Published in pvldb Volume 12, Issue 12

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader