skip to main content
10.1145/775047.775099acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Learning domain-independent string transformation weights for high accuracy object identification

Published:23 July 2002Publication History

ABSTRACT

The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Previous methods of object identification have required manual construction of domain-specific string transformations or manual setting of general transformation parameter weights for recognizing format inconsistencies. This manual process can be time consuming and error-prone. We have developed an object identification system called Active Atlas [18], which applies a set of domain-independent string transformations to compare the objects' shared attributes in order to identify matching objects. In this paper, we discuss extensions to the Active Atlas system, which allow it to learn to tailor the weights of a set of general transformations to a specific application domain through limited user input. The experimental results demonstrate that this approach achieves higher accuracy and requires less user involvement than previous methods across various application domains.

References

  1. N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock. Retrieving and integrating data from multiple information sources. International Journal on Intelligent and Cooperative Information Systems, 2(2):127--158, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255--265, June 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. W. Church and W. A. Gale. Probability scoring for spelling correction. Statistics and Computing, 1:93--103, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  5. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, Seattle, WA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. P. Fellegi and A. B. Sunter. A theory for record-linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  7. W. Frakes and R. Baeza-Yates. Information retrieval: Data structures and algorithms. Prentice Hall, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Ganesh, J. Sirvastava, and T. Richardson. Mining entity-identification rules for database integration. In Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, pages 291--294, Portland, OR, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, pages 1--31, New York, NY, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. A. Hylton. Identifying and merging related bibliographic records. M.S. thesis. MIT Laboratory for Computer Science Technical Report 678, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. G. Philpot, and S. Tejada. The ariadne approach to web-based information integration. International the Journal on Cooperative Information Systems (IJCIS), Special Issue on Intelligent Information Agents: Theory and Applications, 10(1):145--169, 2001.Google ScholarGoogle Scholar
  12. K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377--439, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Lawrence, K. Bollacker, and C. L. Giles. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on Data Mining and Knowledge Discovery, Tuczon, AZ, 1997.Google ScholarGoogle Scholar
  16. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In Fourth International conference on Knowledge Discovery and Data Mining, New York, NY, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Special Issue on Data Extraction, Cleaning, and Reconciliation, Information Systems Journal, 26(8), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Wiederhold. Intelligent integration of information. In Proceedings of ACM SIGMOD conference on management of data, pages 434--437, Washington, DC, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Winkler. Record Linkage Software and Methods for Merging Administrative Lists. Statistical research division Technical Report RR01---03, U.S. Bureau of Census, 2001.Google ScholarGoogle Scholar
  21. T. W. Yan and H. Garcia-Molina. Duplicate removal in information dissemination. In Proceedings of VLDB, Zurich, Switzerland, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning domain-independent string transformation weights for high accuracy object identification

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
            July 2002
            719 pages
            ISBN:158113567X
            DOI:10.1145/775047

            Copyright © 2002 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 July 2002

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader