skip to main content
research-article

Data lake management: challenges and opportunities

Published:01 August 2019Publication History
Skip Abstract Section

Abstract

The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.

References

  1. M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. PVLDB, 6(6):421--432, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR, 2015.Google ScholarGoogle Scholar
  3. S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Brackenbury, R. Liu, M. Mondal, A. J. Elmore, B. Ur, K. Chard, and M. J. Franklin. Draining the data swamp: A similarity-based approach. HILDA, pages 13:1--13:7, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Chen, S. Dadiomov, R. Wesley, G. Xiao, D. Cory, M. J. Cafarella, and J. Mackinlay. Spreadsheet property detection with rule-assisted active learning. In CIKM, pages 999--1008, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Das Sarma, X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD, pages 861--874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google ScholarGoogle Scholar
  10. X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. 2015.Google ScholarGoogle Scholar
  11. J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. In Symposium on Big Data Computing, pages 41--50, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos, pages 198--236, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theory of Computer Science, 336(1):89--124, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and X. Chu. CLAMS: bringing quality to data lakes. In SIGMOD, pages 2089--2092, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In ICDE, pages 1001--1012, 2018.Google ScholarGoogle Scholar
  16. R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Seeping semantics: Linking datasets using word embeddings for data discovery. In ICDE, pages 989--1000, 2018.Google ScholarGoogle Scholar
  17. K. Fisher and D. Walker. The PADS project: an overview. In ICDT, pages 11--17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In SIGMOD, pages 943--958, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In AAAI, pages 1313--1318, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In SIGMOD, pages 2097--2100, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Heimbigner and D. McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 3(3):253--278, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag, K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das, M. Donsky, G. Fierro, C. She, C. Steinbach, V. Subramanian, and E. Sun. Ground: A data context service. In CIDR, 2017.Google ScholarGoogle Scholar
  24. Z. Jin, C. Baik, M. Cafarella, and H. V. Jagadish. Beaver: Towards a declarative schema mapping. In HILDA, pages 10:1--10:4, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G. Terrizzano, C. Christodoulakis, and R. J. Miller. LabBook: Metadata-driven social collaborative data analysis. In IEEE Big Data, pages 431--440, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective, probabilistic approach to schema mapping. In ICDE, pages 921--932, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  27. O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In WWW, pages 75--76, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. J. Miller. Open data integration. PVLDB, 11(12):2130--2139, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, K. Q. Pu, and P. Andritsos. Making open data transparent: Data discovery on open data. IEEE Data Eng. Bull., 41(2):59--70, 2018.Google ScholarGoogle Scholar
  30. F. Nargesian, K. Q. Pu, E. Zhu, B. G. Bashardoost, and R. J. Miller. Optimizing organizations for navigating data lakes, 2018. arXiv:1812.07024.Google ScholarGoogle Scholar
  31. F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. PVLDB, 11(7):813--825, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, pages 73--84, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817--828, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493--504, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. J. Skluzacek, R. Kumar, R. Chard, G. Harrison, P. Beckman, K. Chard, and I. T. Foster. Skluma: An extensible metadata extraction pipeline for disorganized data. In IEEE International Conference on e-Science, pages 256--266, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  37. B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema mappings and data examples. In EDBT, pages 777--780, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Wang, M. Feng, Y. Wang, X. L. Dong, and A. Meliou. Error diagnosis and data profiling with data x-ray. PVLDB, 8(12):1984--1987, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In SIGMOD, pages 847--859, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. Zhu, D. Deng, F. Nargesian, and M. R. J. Josie: Overlap set similarity search for finding joinable tables in data lakes. In SIGMOD, 2019. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH ensemble: Internet-scale domain search. PVLDB, 9(12):1185--1196, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller. Interactive navigation of open data linkages. PVLDB, 10(12):1837--1840, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++ : Incremental inference of ad hoc data formats. In PADL, pages 168--182, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data lake management: challenges and opportunities
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 12, Issue 12
          August 2019
          547 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2019
          Published in pvldb Volume 12, Issue 12

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader