skip to main content
research-article

Global detection of complex copying relationships between sources

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships.

In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.

References

  1. L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.Google ScholarGoogle Scholar
  2. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Buneman. The recovery of trees from measures of dissimilarity. Mathematics the Archeological and Historical Sciences, pages 387--395, 1971.Google ScholarGoogle Scholar
  4. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., New York, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Gansner, Y. Hu, and S. Kobourov. GMap: Drawing graphs and clusters as map. In IEEE Pacific Visualization Symposium, 2010.Google ScholarGoogle Scholar
  10. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader