Abstract
Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships.
In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.
- L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.Google Scholar
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010. Google ScholarDigital Library
- P. Buneman. The recovery of trees from measures of dissimilarity. Mathematics the Archeological and Historical Sciences, pages 387--395, 1971.Google Scholar
- T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., New York, 1991. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarDigital Library
- X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarDigital Library
- E. Gansner, Y. Hu, and S. Kobourov. GMap: Drawing graphs and clusters as map. In IEEE Pacific Visualization Symposium, 2010.Google Scholar
- N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006. Google ScholarDigital Library
- J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378, 2007. Google ScholarDigital Library
- S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of SIGMOD, 2003. Google ScholarDigital Library
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarDigital Library
Recommendations
Mark-sweep or copying?: a "best of both worlds" algorithm and a hardware-supported real-time implementation
ISMM '07: Proceedings of the 6th international symposium on Memory managementCopying collectors offer a number of advantages over their mark-sweep counterparts. First, they do not have to deal with mark stacks and potential mark stack overflows. Second, they do not suffer from unpredictable fragmentation overheads since they ...
Jade: A High-throughput Concurrent Copying Garbage Collector
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer SystemsGarbage collection (GC) pauses are a notorious issue threatening the latency of applications. To mitigate this problem, state-of-the-art concurrent copying collectors allow GC threads to run simultaneously with application threads (mutators) in nearly ...
Comments