nach oben

Information Systems Frontiers

Erschienen in:

24.05.2020

An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

verfasst von: Claudia Diamantini, Paolo Lo Giudice, Domenico Potena, Emanuele Storti, Domenico Ursino

Erschienen in: Information Systems Frontiers | Ausgabe 1/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the last years, data lakes are emerging as an effective and an efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. In this scenario, one of the most challenging issues to address consists in the extraction of topic-guided (i.e., thematic) views from the (very heterogeneous and often unstructured) sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structuring” unstructured data. Finally, we define a technique to extract topic-guided views from the sources of a data lake, based on similarity and other semantic relationships among source metadata.

Vorheriger Artikel Towards End-to-End Multilingual Question Answering

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://dbpedia.org/

https://www.zaloni.com/

Recall that, in database context, a view is the result of a query or a more complex extraction process that can be exploited by users for further computations.

http://www.opencalais.com

Here and in the following, to make the presentation smoother, we use the term “source” (resp., “keyword”) to denote both the source (resp., a keyword) and the corresponding node associated with it.

In this paper, we use the term “lemma” according to the meaning it has in BabelNet (Navigli and Ponzetto 2012). Here, given a term, its lemmas are other objects (terms, emoticons, etc.) that contribute to specify its meaning.

Note that Phases 2 and 4 could be merged into a unique one, avoiding to define arcs with label “lemmaOf”. Here, we maintain these arcs and both phases to keep the information about similarity between nodes for future uses.

Whenever this does not happen, the mapping can be automatically provided by the DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).

Here, two nodes are assumed to be equal if the corresponding names coincide.

In Figs. 3 and 4, we do not show the arc labels for the sources C, W and E because all of them are “contains” and their presence would have complicated the layout unnecessarily.

Hereafter, we use the notation S.o to indicate the object o of the source S.

In this figure, for layout reasons, we do not show the arc labels because they are the same as the ones of the corresponding arcs of Figs. 3, 4 and 5.

Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://dbpedia.org/resource/

Consider that, since we have 20 real sources in the data lakes adopted in our experimental campaign, the value of H_j can range in the real interval [0.05, 20].

As a matter of fact, a topic set with 8 keywords would encompass a great number of different concepts and, as such, it would not be generally able to capture a clear and specific desire of a user.

Abiteboul, S., & Duschka, O. (1998). Complexity of answering queries using materialized views. In Proc. of the International Symposium on Principles of Database Systems (SIGMOD/PODS’98) (pp. 254– 263). Seattle: ACM.

Aversano, L., Intonti, R., Quattrocchi, C., & Tortorella, M. (2010). Building a virtual view of heterogeneous data source views. In Proc. of the International Conference on Software and Data Technologies (ICSOFT’10) (pp. 266–275). Athens: INSTICC Press.

Bachtarzi, C., & Bachtarzi, F. (2015). A model-driven approach for materialized views definition over heterogeneous databases. In Proc. of the International Conference on New Technologies of Information and Communication (NTIC’15) (pp. 1–5). Mila: IEEE.

Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. (2001). Semantic integration and query of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 215–249.CrossRef

Bidoit, N., Colazzo, D., Malla, N., & Sartiani, C. (2018). Evaluating queries and updates on big xml documents. Information Systems Frontiers, 20(1), 63–90.CrossRef

Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2016). Towards intelligent data analysis: the metadata challenge. In Proc. of the International Conference on Internet of Things and Big Data (ioTBD’16) (pp. 331–338). Rome, Italy.

Biskup, J., & Embley, D. (2003). Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3), 169–212. Elsevier.CrossRef

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Microtone Publishing.

Bouadjenek, M.R., Hacid, H., & Bouzeghoub, M. (2016). Social networks and information retrieval, how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms. Information Systems, 56, 1–18.CrossRef

Bougouin, A., Boudin, F., & Daille, B. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proc.of the International Joint Conference on Natural Language Processing (IJCNLP’13) (pp. 543–551). Nagoya: Asian Federation of Natural Language Processing.

Brackenbury, W., Liu, R., Mondal, M., Elmore, A., Ur, B., Chard, K., & Franklin, M. (2018). Draining the data swamp: A similarity-based approach. In Proc. of the International Workshop on Human-in-the-loop Data Analytics (HILDA’18) (p. 13). Houston: ACM.

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289. Elsevier.CrossRef

Castano, S., & Antonellis, V.D. (1999). Building views over semistructured data sources. In Proc. of the International Conference on Conceptual Modeling (ER’99) (pp. 146–160). Paris: Springer.

Chen, C., Shyu, M.-L., & Chen, S.-C. (2016). Weighted subspace modeling for semantic concept retrieval using gaussian mixture models. Information Systems Frontiers, 18(5), 877–889.CrossRef

Corbellini, A., Mateos, C., Zunino, A., Godoy, D., & Schiaffino, S. (2017). Persisting big-data: The NoSQL landscape. Information Systems, 63, 1–23. Elsevier.CrossRef

De Meo, P., Quattrone, G., Terracina, G., & Ursino, D. (2006). Integration of XML Schemas at various “severity” levels. Information Systems, 31(6), 397–434.CrossRef

Debattista, J., Lange, C., & Auer, S. (2014). Representing dataset quality metadata using multi-dimensional views. In Proc. of the International Conference on Semantic Systems (SEM’14) (pp. 92–99). Leipzig: ACM.

Dessi, A., & Atzori, M. (2016). A machine-learning approach to ranking rdf properties. Future Generation Computer Systems, 54, 366–377.CrossRef

Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. Technical report.

Fan, W., Wang, X., & Wu, Y. (2016). Answering pattern queries using views. IEEE Transactions on Knowledge and Data Engineering, 28(2), 326–341. IEEE.CrossRef

Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Proc. of the International Conference on Cyber Technology in Automation (CYBER’15) (pp. 820–824). Shenyang: IEEE.

Farid, M., Roatis, A., Ilyas, I., Hoffmann, H., & Chu, X. (2016). CLAMS: bringing quality to data lakes. In Proc. of the International Conference on Management of Data (SIGMOD/PODS’16) (pp. 2089–2092). San Francisco: ACM.

García-Moya, L., Kudama, S., Aramburu, M., & Berlanga, R. (2013). Storing and analysing voice of the market data in the corporate data warehouse. Information Systems Frontiers, 15(3), 331–349.CrossRef

Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In Proc. of the International Conference on Management of Data (SIGMOD 2016) (pp. 2097–2100). San Francisco: ACM.

Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In Proc. of the International Conference on European Conference on Advances in Databases and Information Systems(ADBIS’18) (pp. 35–49). Budapest: Springer.

Halevy, A. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. Springer.CrossRef

Hamadou, H., & Ghozzi, F. (2018). Querying heterogeneous document stores. In Proc. of the International Conference on Enterprise Information Systems (ICEIS’18) (pp. 58–68). Madeira, Portugal.

Heath, T., & Bizer, C. (2011). Linked data:, Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1–136.CrossRef

Hirschman, A. (1964). The paternity of an index. The American Economic Review, 54(5), 761–762.

Hitzler, P., & Janowicz, K. (2013). Linked data, big data, and the 4th paradigm. Semantic Web, 4(3), 233–235.CrossRef

Janjua, N., Hussain, F., & Hussain, O. (2013). Semantic information and knowledge integration through argumentative reasoning to support intelligent decision making. Information Systems Frontiers, 15(2), 167–192.CrossRef

Keith, A., Cyganiak, R., Hausenblas, M., & Zhao, J. (2011). Describing linked datasets with the void vocabulary. Technical report.

Klettke, M., Awolin, H., Storl, U., Muller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In Proc. of the International Conference on Big data (IEEE bigdata 2017) (pp. 2462–2471). Boston: IEEE.

Kondrak, G. (2005). N-gram similarity and distance. In String processing and Information Retrieval (pp. 115–126): Springer.

Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A., Gottlob, G., Keane, J., & Libkin, L. (2017). The VADA architecture for cost-effective data wrangling. In Proc. of the International Conference on Management of Data (SIGMOD’17) (pp. 1599–1602). Chicago: ACM.

Lassila, O., Swick, R.R., & et al. (1998). Resource description framework (rdf) model and syntax specification.

Maccioni, A., & Torlone, R. (2018). KAYAK: a framework for just-in-time data preparation in a data lake. In Proc. of the international Conference on Advanced information Systems Engineering (CAiSE’18) (pp. 474–489). Tallinn: Springer.

Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proc.of the international conference on very large data bases (VLDB 2001) (pp. 49–58). Morgan Kaufmann: Rome.

McPherson, M., Smith-Lovin, L., & Cook, J. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415–444. JSTOR.CrossRef

Mouttham, A., Kuziemsky, C., Langayan, D., Peyton, L., & Pereira, J. (2012). Interoperable support for collaborative, mobile, and accessible health care. Information Systems Frontiers, 14(1), 73–85.CrossRef

Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., & Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.CrossRef

Tsvetovat, M., & Kouznetsov, A. (2011). Social Network Analysis for startups: Finding connections on the social web. O’Reilly Media Inc.

Navigli, R., & Ponzetto, S. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. Elsevier.CrossRef

Oram, A. (2015). Managing the Data Lake Sebastopol. O’Reilly: USA.

Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). Intensional and extensional integration and abstraction of heterogeneous databases. Data & Knowledge Engineering, 35(3), 201–237.CrossRef

Palopoli, L., Saccà, D., Terracina, G., & Ursino, D. (2003a). Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15 (2), 271–294.CrossRef

Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph-based approach for extracting terminological properties of elements of XML documents. In Proc. of the International Conference on Data Engineering (ICDE 2001) (pp. 330–337). Heidelberg: IEEE Computer Society.

Palopoli, L., Terracina, G., & Ursino, D. (2003b). DIKE: A system supporting the semi-automatic construction of Cooperative Information Systems from heterogeneous databases. Software Practice & Experience, 33(9), 847–884.CrossRef

Palopoli, L., Terracina, G., & Ursino, D. (2003c). Experiences using DIKE, a system for supporting cooperative information system and data warehouse design. Information Systems, 28(7), 835–865.CrossRef

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20. Wiley, New York.

Singh, K., & Singh, V. (2016). Answering graph pattern query using incremental views. In Proc.of the international conference on computing (ICCCA’16) (pp. 54–59). Greater Noida: IEEE.

Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. (2001). Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.CrossRef

Wang, J., Li, J., & Yu, J. (2011). Answering tree pattern queries using views: a revisit. In Proc.of the international conference on extending database technology (EDBT/ICDT’11) (pp. 153–164). Uppsala: ACM.

Wang, J., & Yu, J. (2012). Revisiting answering tree pattern queries using views. ACM Transactions on Database Systems, 37(3), 18. ACM.CrossRef

Wu, X., Theodoratos, D., & Wang, W. (2009). Answering XML queries using materialized views revisited. In Proc. of the International Conference on Information and Knowledge Management (CIKM ’09) (pp. 475–484). Hong Kong: ACM.

Yi, J., Maghoul, F., & Pedersen, J. (2008). Deciphering mobile search patterns: a study of yahoo! mobile search queries. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08 (pp. 257–266). New York: ACM.

Titel: An Approach to Extracting Topic-guided Views from the Sources of a Data Lake
verfasst von: Claudia Diamantini
Paolo Lo Giudice
Domenico Potena
Emanuele Storti
Domenico Ursino
Publikationsdatum: 24.05.2020
Verlag: Springer US
Erschienen in: Information Systems Frontiers / Ausgabe 1/2021
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI: https://doi.org/10.1007/s10796-020-10010-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2021

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks

Cloud and edge based data analytics for privacy-preserving multi-modal engagement monitoring in the classroom

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Atypical Sample Regularizer Autoencoder for Cross-Domain Human Activity Recognition

NetDER: An Architecture for Reasoning About Malicious Behavior

Quarry: A User-centered Big Data Integration Platform

Premium Partner