Abstract
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.
- M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. PVLDB, 6(6):421--432, 2013. Google ScholarDigital Library
- A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR, 2015.Google Scholar
- S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015. Google ScholarDigital Library
- W. Brackenbury, R. Liu, M. Mondal, A. J. Elmore, B. Ur, K. Chard, and M. J. Franklin. Draining the data swamp: A similarity-based approach. HILDA, pages 13:1--13:7, 2018. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
- Z. Chen, S. Dadiomov, R. Wesley, G. Xiao, D. Cory, M. J. Cafarella, and J. Mackinlay. Spreadsheet property detection with rule-assisted active learning. In CIKM, pages 999--1008, 2017. Google ScholarDigital Library
- A. Das Sarma, X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD, pages 861--874, 2008. Google ScholarDigital Library
- D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google Scholar
- X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. 2015.Google Scholar
- J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. In Symposium on Big Data Computing, pages 41--50, 2015.Google ScholarCross Ref
- R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos, pages 198--236, 2009. Google ScholarDigital Library
- R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theory of Computer Science, 336(1):89--124, 2005. Google ScholarDigital Library
- M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and X. Chu. CLAMS: bringing quality to data lakes. In SIGMOD, pages 2089--2092, 2016. Google ScholarDigital Library
- R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A data discovery system. In ICDE, pages 1001--1012, 2018.Google Scholar
- R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Seeping semantics: Linking datasets using word embeddings for data discovery. In ICDE, pages 989--1000, 2018.Google Scholar
- K. Fisher and D. Walker. The PADS project: an overview. In ICDT, pages 11--17, 2011. Google ScholarDigital Library
- Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In SIGMOD, pages 943--958, 2018. Google ScholarDigital Library
- W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In AAAI, pages 1313--1318, 2006. Google ScholarDigital Library
- R. Hai, S. Geisler, and C. Quix. Constance: An intelligent data lake system. In SIGMOD, pages 2097--2100, 2016. Google ScholarDigital Library
- A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In SIGMOD, pages 795--806, 2016. Google ScholarDigital Library
- D. Heimbigner and D. McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 3(3):253--278, 1985. Google ScholarDigital Library
- J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag, K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das, M. Donsky, G. Fierro, C. She, C. Steinbach, V. Subramanian, and E. Sun. Ground: A data context service. In CIDR, 2017.Google Scholar
- Z. Jin, C. Baik, M. Cafarella, and H. V. Jagadish. Beaver: Towards a declarative schema mapping. In HILDA, pages 10:1--10:4, 2018. Google ScholarDigital Library
- E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. G. Terrizzano, C. Christodoulakis, and R. J. Miller. LabBook: Metadata-driven social collaborative data analysis. In IEEE Big Data, pages 431--440, 2015. Google ScholarDigital Library
- A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective, probabilistic approach to schema mapping. In ICDE, pages 921--932, 2017.Google ScholarCross Ref
- O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In WWW, pages 75--76, 2016. Google ScholarDigital Library
- R. J. Miller. Open data integration. PVLDB, 11(12):2130--2139, 2018. Google ScholarDigital Library
- R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, K. Q. Pu, and P. Andritsos. Making open data transparent: Data discovery on open data. IEEE Data Eng. Bull., 41(2):59--70, 2018.Google Scholar
- F. Nargesian, K. Q. Pu, E. Zhu, B. G. Bashardoost, and R. J. Miller. Optimizing organizations for navigating data lakes, 2018. arXiv:1812.07024.Google Scholar
- F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. PVLDB, 11(7):813--825, 2018. Google ScholarDigital Library
- R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012. Google ScholarDigital Library
- L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. In SIGMOD, pages 73--84, 2012. Google ScholarDigital Library
- A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817--828, 2012. Google ScholarDigital Library
- Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493--504, 2014. Google ScholarDigital Library
- T. J. Skluzacek, R. Kumar, R. Chard, G. Harrison, P. Beckman, K. Chard, and I. T. Foster. Skluma: An extensible metadata extraction pipeline for disorganized data. In IEEE International Conference on e-Science, pages 256--266, 2018.Google ScholarCross Ref
- B. ten Cate, P. G. Kolaitis, and W. C. Tan. Schema mappings and data examples. In EDBT, pages 777--780, 2013. Google ScholarDigital Library
- X. Wang, M. Feng, Y. Wang, X. L. Dong, and A. Meliou. Error diagnosis and data profiling with data x-ray. PVLDB, 8(12):1984--1987, 2015. Google ScholarDigital Library
- M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011. Google ScholarDigital Library
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: Entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97--108, 2012. Google ScholarDigital Library
- C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In SIGMOD, pages 847--859, 2016. Google ScholarDigital Library
- E. Zhu, D. Deng, F. Nargesian, and M. R. J. Josie: Overlap set similarity search for finding joinable tables in data lakes. In SIGMOD, 2019. To appear. Google ScholarDigital Library
- E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. LSH ensemble: Internet-scale domain search. PVLDB, 9(12):1185--1196, 2016. Google ScholarDigital Library
- E. Zhu, K. Q. Pu, F. Nargesian, and R. J. Miller. Interactive navigation of open data linkages. PVLDB, 10(12):1837--1840, 2017. Google ScholarDigital Library
- K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++ : Incremental inference of ad hoc data formats. In PADL, pages 168--182, 2012. Google ScholarDigital Library
Index Terms
- Data lake management: challenges and opportunities
Recommendations
ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics
IDEAS '21: Proceedings of the 25th International Database Engineering & Applications SymposiumWith new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) ...
The concept of an intelligent data lake management system: machine consciousness and a universal data model
AbstractThe concept of a data lake is proposed, the management system of which has a machine consciousness to control the parameters of the data lake, resolve independently from emergency situations and interact with administrators. For data storage and ...
Comments