Skip to main content
Top
Published in: Distributed and Parallel Databases 2/2018

16-10-2017

Executable schema mappings for statistical data processing

Authors: Paolo Atzeni, Luigi Bellomarini, Francesca Bugiotti, Marco De Leonardis

Published in: Distributed and Parallel Databases | Issue 2/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Data processing is the core of any statistical information system. Statisticians are interested in specifying transformations and manipulations of data at a high level, in terms of entities of statistical models. We illustrate here a proposal where a high-level language, EXL, is used for the declarative specification of statistical programs, and a translation into executable form in various target systems is available. The language is based on the theory of schema mappings, in particular those defined by a specific class of tgds, which we actually use to optimize user programs and facilitate the translation towards several target systems. The characteristics of such class guarantee good tractability properties and the applicability in Big Data settings. A concrete implementation, EXLEngine, has been carried out and is currently used at the Bank of Italy.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The seasonal decomposition is an operator that decomposes a time series into various components, one of which is the trend, which, roughly speaking considers medium- or long-term “variations”, ignoring seasonal, cyclic (and stochastic) ones [12, 33].
 
2
Note that for the first quarter the PCHNG is not meaningful.
 
3
As we will see in Sect. 5, we also have some egds, which enforce the functional nature of EXL relations.
 
4
That is, repeated elements are meaningful.
 
5
We could say “at most” one operator, but it is easy to assume that there are no statements that just copy a relation with no additional operations.
 
6
The case with several relations is indeed possible and we will discuss it in Sect. 5.4.
 
7
This total order is not strictly necessary, the only thing that is needed is that the rules that involve these general operators are applied only after their operands have been fully computed.
 
8
As in the rest of the paper, we refer to tgds with one atom in the rhs.
 
9
Indeed, the case where the two operators are multi-tuple and have different grouping dimensions requires a slight extension of the syntax, where the grouping dimensions would be specified as an argument of the operator itself, so, for example \(R(x,y,z) \rightarrow Q(x, \hbox {max}(\hbox {avg}(z, \hbox {group by}~x, y)))\), calculates the maximum, grouped by \(x\), of the averages of z, grouped by \(x\) and \(y\).
 
10
Notice that there is the residual, and indeed remote, possibility in which the repeated dimension tuple has the identity element as its measure for the aggregation under consideration or that, in general, the repeated tuples compensate the error. However this condition is value and aggregation dependent and should be considered as a case of “correctness by chance”.
 
11
There is the residual possibility that two EXL statements share a subsexpression, resulting in two tgds sharing one ore more atoms of the lhs. Since we break down all the statements into elementary statements, we could end up having tgds with coinciding premises, which indeed we detect and simplify in the system.
 
Literature
1.
go back to reference Arenas, M., Fagin, R., Nash, A.: Composition with target constraints. Logical Methods Comput. Sci. 7(3) (2011) Arenas, M., Fagin, R., Nash, A.: Composition with target constraints. Logical Methods Comput. Sci. 7(3) (2011)
2.
go back to reference Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS, pp. 14–26 (2014) Arenas, M., Gottlob, G., Pieris, A.: Expressive languages for querying the semantic web. In: PODS, pp. 14–26 (2014)
3.
go back to reference Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P., Gianforme, G.: Model-independent schema translation. VLDB J. 17, 1347–1370 (2008)CrossRef Atzeni, P., Cappellari, P., Torlone, R., Bernstein, P., Gianforme, G.: Model-independent schema translation. VLDB J. 17, 1347–1370 (2008)CrossRef
4.
go back to reference Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: MISM: a platform for model-independent solutions to model management problems. J. Data Semant. 14, 133–161 (2009)CrossRef Atzeni, P., Bellomarini, L., Bugiotti, F., Gianforme, G.: MISM: a platform for model-independent solutions to model management problems. J. Data Semant. 14, 133–161 (2009)CrossRef
5.
go back to reference Atzeni, P., Bellomarini, L., Bugiotti, F., Celli, F., Gianforme, G.: A runtime approach to model-generic translation of schema and data. Inf. Syst. 37, 269–287 (2012)CrossRef Atzeni, P., Bellomarini, L., Bugiotti, F., Celli, F., Gianforme, G.: A runtime approach to model-generic translation of schema and data. Inf. Syst. 37, 269–287 (2012)CrossRef
6.
go back to reference Atzeni, P., Bellomarini, L., Bugiotti, F.: Exlengine: executable schema mappings for statistical data processing. In: EDBT, pp. 672–682 (2013) Atzeni, P., Bellomarini, L., Bugiotti, F.: Exlengine: executable schema mappings for statistical data processing. In: EDBT, pp. 672–682 (2013)
7.
go back to reference Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: IJCAI, pp. 2–10 (2017) Bellomarini, L., Gottlob, G., Pieris, A., Sallinger, E.: Swift logic for big data and knowledge graphs. In: IJCAI, pp. 2–10 (2017)
8.
go back to reference Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007) Bernstein, P.A., Melnik, S.: Model management 2.0: manipulating richer mappings. In: SIGMOD Conference, pp. 1–12 (2007)
9.
go back to reference Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. PVLDB 7(7), 553–564 (2014) Boehm, M., Tatikonda, S., Reinwald, B., Sen, P., Tian, Y., Burdick, D., Vaithyanathan, S.: Hybrid parallelization strategies for large-scale machine learning in systemml. PVLDB 7(7), 553–564 (2014)
10.
go back to reference Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger, R.: Heptox: Marrying XML and heterogeneity in your P2P databases. In: VLDB, pp. 1267–1270 (2005) Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger, R.: Heptox: Marrying XML and heterogeneity in your P2P databases. In: VLDB, pp. 1267–1270 (2005)
11.
go back to reference Brockwell, P.J., Davis, R.A. (eds.): Introduction to Time Series and Forecasting. Springer, New York (2002)MATH Brockwell, P.J., Davis, R.A. (eds.): Introduction to Time Series and Forecasting. Springer, New York (2002)MATH
12.
go back to reference Calì, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: PODS, pp. 77–86 (2009) Calì, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: PODS, pp. 77–86 (2009)
13.
go back to reference Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of query answering in description logics (extended abstract). In: IJCAI, pp. 4163–4167 (2015) Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of query answering in description logics (extended abstract). In: IJCAI, pp. 4163–4167 (2015)
14.
go back to reference Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, PODS ’98, pp. 34–43, New York, NY, USA, (1998). ACM Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, PODS ’98, pp. 34–43, New York, NY, USA, (1998). ACM
15.
go back to reference Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: VLDB, pp. 354–366. Morgan Kaufmann, Burlington (1994) Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: VLDB, pp. 354–366. Morgan Kaufmann, Burlington (1994)
16.
go back to reference Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009) Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)
17.
go back to reference Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: Ricardo: integrating R and hadoop. In: SIGMOD, pp. 987–998 (2010) Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: Ricardo: integrating R and hadoop. In: SIGMOD, pp. 987–998 (2010)
20.
go back to reference Dessloch, S., Hernández, M., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: ICDE, pp. 1307–1316 (2008) Dessloch, S., Hernández, M., Wisnesky, R., Radwan, A., Zhou, J.: Orchid: integrating schema mapping and ETL. In: ICDE, pp. 1307–1316 (2008)
21.
22.
go back to reference Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: ICDT, pp. 207–224 (2003) Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. In: ICDT, pp. 207–224 (2003)
23.
go back to reference Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005)CrossRefMATH Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005)CrossRefMATH
24.
go back to reference Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)CrossRef Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)CrossRef
25.
go back to reference Fagin, R., Haas, L., Hernández, M., Miller, R., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Conceptual Modeling: Foundations and Applications, pp. 198–236 (2009) Fagin, R., Haas, L., Hernández, M., Miller, R., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Conceptual Modeling: Foundations and Applications, pp. 198–236 (2009)
26.
go back to reference Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222 (2011) Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Schema Matching and Mapping, pp. 191–222 (2011)
27.
go back to reference Gottlob, G., Pichler, R., Savenkov, V.: Normalization and optimization of schema mappings. PVLDB 2(1), 1102–1113 (2009) Gottlob, G., Pichler, R., Savenkov, V.: Normalization and optimization of schema mappings. PVLDB 2(1), 1102–1113 (2009)
28.
go back to reference Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810. ACM (2005) Haas, L.M., Hernández, M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: from research prototype to industrial tool. In: SIGMOD, pp. 805–810. ACM (2005)
29.
go back to reference Kolaitis, P.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005) Kolaitis, P.: Schema mappings, data exchange, and metadata management. In: PODS, pp. 61–75 (2005)
30.
go back to reference Kolaitis, P.G., Panttaja, J., Tan, W.C.: The complexity of data exchange. In: SIGMOD, pp. 30–39 (2006) Kolaitis, P.G., Panttaja, J., Tan, W.C.: The complexity of data exchange. In: SIGMOD, pp. 30–39 (2006)
31.
go back to reference Mahdi, E.: A survey of r software for parallel computing. Am. J. Appl. Math. Stat. 2(4), 224–230 (2014)CrossRef Mahdi, E.: A survey of r software for parallel computing. Am. J. Appl. Math. Stat. 2(4), 224–230 (2014)CrossRef
32.
go back to reference Mecca, G., Papotti, P., Raunich, S.: Core schema mappings: scalable core computations in data exchange. Inf. Syst. 37(7), 677–711 (2012)CrossRef Mecca, G., Papotti, P., Raunich, S.: Core schema mappings: scalable core computations in data exchange. Inf. Syst. 37(7), 677–711 (2012)CrossRef
33.
go back to reference Mumick, I.S., Pirahesh, H., Ramakrishnan, R.: The magic of duplicates and aggregates. In: VLDB, pp. 264–277 (1990) Mumick, I.S., Pirahesh, H., Ramakrishnan, R.: The magic of duplicates and aggregates. In: VLDB, pp. 264–277 (1990)
34.
go back to reference Ramsay, J.O., Hooker, G., Graves, S. (eds.): Functional Data Analysis with R and Matlab. Springer, New York (2009)MATH Ramsay, J.O., Hooker, G., Graves, S. (eds.): Functional Data Analysis with R and Matlab. Springer, New York (2009)MATH
35.
go back to reference Sallinger, E.: Reasoning about schema mappings. In: Data Exchange, Integration, and Streams, pp. 97–127 (2013) Sallinger, E.: Reasoning about schema mappings. In: Data Exchange, Integration, and Streams, pp. 97–127 (2013)
36.
go back to reference Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., Mansmann, U.: State of the art in parallel computing with r. J. Stat. Softw. 31(1), 1–27 (2009). 8CrossRef Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., Mansmann, U.: State of the art in parallel computing with r. J. Stat. Softw. 31(1), 1–27 (2009). 8CrossRef
37.
go back to reference Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SCIDB. In: CIDR (2009) Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SCIDB. In: CIDR (2009)
Metadata
Title
Executable schema mappings for statistical data processing
Authors
Paolo Atzeni
Luigi Bellomarini
Francesca Bugiotti
Marco De Leonardis
Publication date
16-10-2017
Publisher
Springer US
Published in
Distributed and Parallel Databases / Issue 2/2018
Print ISSN: 0926-8782
Electronic ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-017-7212-2

Other articles of this Issue 2/2018

Distributed and Parallel Databases 2/2018 Go to the issue

Premium Partner