Abstract
We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, "the more the better" does not always hold for data integration and often "less is more".
In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.
- M. Balazinska, B. Howe, and D. Suciu. Data markets in the cloud: An opportunity for the database community. PVLDB, 4(12), 2011.Google Scholar
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1-41, 2008. Google Scholar
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google Scholar
- X. L. Dong and F. Naumann. Data fusion-resolving data conflicts for integration. PVLDB, 2009. Google Scholar
- X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. http://lunadong.com/publication/marginalism report.pdfGoogle Scholar
- S. M. Embury, P. Missier, S. Sampaio, R. M. Greenwood, and A. D. Preece. Incorporating domain-specific information quality constraints into database queries. J. Data and Information Quality, 1(2), 2009. Google Scholar
- T. Feo and M. G. Resende. Greedy randomized adaptive search procedures. J. of Global Optimization, 6, 1995.Google Scholar
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. Google Scholar
- K. Hose, A. Roth, A. Zeitz, K.-U. Sattler, and F. Naumann. A research agenda for query processing in large-scale peer data management systems. Inf. Syst., 33(7-8):597-610, 2008. Google Scholar
- X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2), 2013. Google Scholar
- A. Marshall. Pinciples of Economics. Prometheus Books, 1890.Google Scholar
- W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool, 2010. Google Scholar
- G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.Google Scholar
- F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, 1998.Google Scholar
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877-885, 2010. Google Scholar
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324-2329, 2011. Google Scholar
- H. Qu, J. Xu, and A. Labrinidis. Quality is in the eye of the beholder: towards user-centric web-databases. In SIGMOD, 2007. Google Scholar
- M. A. Suryanto, E.-P. Lim, A. Sun, and R. Chiang. Quality-aware collaborative question answering: Methods and evaluation. In WSDM, 2009. Google Scholar
- H. Wu, Q. Luo, J. Li, and A. Labrinidis. Quality aware query scheduling in wireless sensor networks. In DMSN, 2009. Google Scholar
- N. K. Yeganeh, S. Sadiq, K. Deng, and X. Zhou. Data quality aware queries in collaborative information systems. Lecture Notes in Computer Science, 5446:39-50, 2009. Google Scholar
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google Scholar
- X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217-226, 2011. Google Scholar
- B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550-561, 2012. Google Scholar
Index Terms
- Less is more: selecting sources wisely for integration
Recommendations
Less Nervous MRP Systems: A Dynamic Economic Lot-Sizing Approach
The Wagner-Whitin dynamic economic lot-sizing technique has not been widely applied to real-world production scheduling problems. A frequently quoted reason is the extreme sensitivity of the solution to changes in the estimates of future values of the ...
Economic production quantity with rework process at a single-stage manufacturing system with planned backorders
In traditional inventory models such as the economic order quantity (EOQ) and the economic production quantity (EPQ) the sole objective is to minimize the total inventory-related costs, typically holding cost and ordering cost. These models do not ...
Short Communication: Economic order quantity model for items with imperfect quality, different holding costs, and learning effects: A note
In this note, we present the optimal lot sizes for an item with imperfect quality based on Salameh and Jaber [Salameh, M. K., & Jaber, M. Y. (2000). Economic production quantity model for items with imperfect quality. International Journal of Production ...
Comments