Top

Distributed and Parallel Databases

Published in:

16-10-2017

Executable schema mappings for statistical data processing

Authors: Paolo Atzeni, Luigi Bellomarini, Francesca Bugiotti, Marco De Leonardis

Published in: Distributed and Parallel Databases | Issue 2/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Data processing is the core of any statistical information system. Statisticians are interested in specifying transformations and manipulations of data at a high level, in terms of entities of statistical models. We illustrate here a proposal where a high-level language, EXL, is used for the declarative specification of statistical programs, and a translation into executable form in various target systems is available. The language is based on the theory of schema mappings, in particular those defined by a specific class of tgds, which we actually use to optimize user programs and facilitate the translation towards several target systems. The characteristics of such class guarantee good tractability properties and the applicability in Big Data settings. A concrete implementation, EXLEngine, has been carried out and is currently used at the Bank of Italy.

next article Flexible partitioning for selective binary theta-joins in a massively parallel setting

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

The seasonal decomposition is an operator that decomposes a time series into various components, one of which is the trend, which, roughly speaking considers medium- or long-term “variations”, ignoring seasonal, cyclic (and stochastic) ones [12, 33].

Note that for the first quarter the PCHNG is not meaningful.

As we will see in Sect. 5, we also have some egds, which enforce the functional nature of EXL relations.

That is, repeated elements are meaningful.

We could say “at most” one operator, but it is easy to assume that there are no statements that just copy a relation with no additional operations.

The case with several relations is indeed possible and we will discuss it in Sect. 5.4.

This total order is not strictly necessary, the only thing that is needed is that the rules that involve these general operators are applied only after their operands have been fully computed.

As in the rest of the paper, we refer to tgds with one atom in the rhs.

Indeed, the case where the two operators are multi-tuple and have different grouping dimensions requires a slight extension of the syntax, where the grouping dimensions would be specified as an argument of the operator itself, so, for example \(R(x,y,z) \rightarrow Q(x, \hbox {max}(\hbox {avg}(z, \hbox {group by}~x, y)))\), calculates the maximum, grouped by \(x\), of the averages of z, grouped by \(x\) and \(y\).

Notice that there is the residual, and indeed remote, possibility in which the repeated dimension tuple has the identity element as its measure for the aggregation under consideration or that, in general, the repeated tuples compensate the error. However this condition is value and aggregation dependent and should be considered as a case of “correctness by chance”.

There is the residual possibility that two EXL statements share a subsexpression, resulting in two tgds sharing one ore more atoms of the lhs. Since we break down all the statements into elementary statements, we could end up having tgds with coinciding premises, which indeed we detect and simplify in the system.

http://kettle.pentaho.com/.

http://mahout.apache.org.

http://spark.apache.org.

Arenas, M., Fagin, R., Nash, A.: Composition with target constraints. Logical Methods Comput. Sci. 7(3) (2011)

Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS, pp. 14–26 (2014)

Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P., Gianforme, G.: Model-independent schema translation. VLDB J. 17, 1347–1370 (2008)CrossRef

Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: MISM: a platform for model-independent solutions to model management problems. J. Data Semant. 14, 133–161 (2009)CrossRef

Atzeni, P., Bellomarini, L., Bugiotti, F., Celli, F., Gianforme, G.: A runtime approach to model-generic translation of schema and data. Inf. Syst. 37, 269–287 (2012)CrossRef

Atzeni, P., Bellomarini, L., Bugiotti, F.: Exlengine: executable schema mappings for statistical data processing. In: EDBT, pp. 672–682 (2013)

Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: IJCAI, pp. 2–10 (2017)

Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007)

Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. PVLDB 7(7), 553–564 (2014)

10.

Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger, R.: Heptox: Marrying XML and heterogeneity in your P2P databases. In: VLDB, pp. 1267–1270 (2005)

11.

Brockwell, P.J., Davis, R.A. (eds.): Introduction to Time Series and Forecasting. Springer, New York (2002)MATH

12.

Calì, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: PODS, pp. 77–86 (2009)

13.

Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of query answering in description logics (extended abstract). In: IJCAI, pp. 4163–4167 (2015)

14.

Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, PODS ’98, pp. 34–43, New York, NY, USA, (1998). ACM

15.

Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: VLDB, pp. 354–366. Morgan Kaufmann, Burlington (1994)

16.

Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)

17.

Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: Ricardo: integrating R and hadoop. In: SIGMOD, pp. 987–998 (2010)

18.

Del Vecchio, V.: Statistical data and concepts representation. Bank of Italy (1997). http://goo.gl/YIAqDp

19.

Del Vecchio, V., Di Giovanni, F., Pambianco, S.: The “matrix” model. Bank of Italy (2007). http://goo.gl/Dj2XT0

20.

Dessloch, S., Hernández, M., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: ICDE, pp. 1307–1316 (2008)

21.

Di Giovanni, F., Piazza, D.: Processing and managing statistical data: a national central bank experience. Bank of Italy (2009). http://goo.gl/ZNi5zh

22.

Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: ICDT, pp. 207–224 (2003)

23.

Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005)CrossRefMATH

24.

Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)CrossRef

25.

Fagin, R., Haas, L., Hernández, M., Miller, R., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Conceptual Modeling: Foundations and Applications, pp. 198–236 (2009)

26.

Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222 (2011)

27.

Gottlob, G., Pichler, R., Savenkov, V.: Normalization and optimization of schema mappings. PVLDB 2(1), 1102–1113 (2009)

28.

Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810. ACM (2005)

29.

Kolaitis, P.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)

30.

Kolaitis, P.G., Panttaja, J., Tan, W.C.: The complexity of data exchange. In: SIGMOD, pp. 30–39 (2006)

31.

Mahdi, E.: A survey of r software for parallel computing. Am. J. Appl. Math. Stat. 2(4), 224–230 (2014)CrossRef

32.

Mecca, G., Papotti, P., Raunich, S.: Core schema mappings: scalable core computations in data exchange. Inf. Syst. 37(7), 677–711 (2012)CrossRef

33.

Mumick, I.S., Pirahesh, H., Ramakrishnan, R.: The magic of duplicates and aggregates. In: VLDB, pp. 264–277 (1990)

34.

Ramsay, J.O., Hooker, G., Graves, S. (eds.): Functional Data Analysis with R and Matlab. Springer, New York (2009)MATH

35.

Sallinger, E.: Reasoning about schema mappings. In: Data Exchange, Integration, and Streams, pp. 97–127 (2013)

36.

Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., Mansmann, U.: State of the art in parallel computing with r. J. Stat. Softw. 31(1), 1–27 (2009). 8CrossRef

37.

Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SCIDB. In: CIDR (2009)

Title: Executable schema mappings for statistical data processing
Authors: Paolo Atzeni
Luigi Bellomarini
Francesca Bugiotti
Marco De Leonardis
Publication date: 16-10-2017
Publisher: Springer US
Published in: Distributed and Parallel Databases / Issue 2/2018
Print ISSN: 0926-8782
Electronic ISSN: 1573-7578
DOI: https://doi.org/10.1007/s10619-017-7212-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2018

How to exploit high performance computing in population-based metaheuristics for solving association rule mining problem

Multi-join query optimization in bucket-based encrypted databases using an enhanced ant colony optimization algorithm

An adaptive multi-objective evolutionary algorithm for constrained workflow scheduling in Clouds

Flexible partitioning for selective binary theta-joins in a massively parallel setting

Premium Partner