Article

Learning domain-independent string transformation weights for high accuracy object identification

Authors:
Sheila Tejada

University of Southern, California, Marina del Rey, CA

University of Southern, California, Marina del Rey, CA
View Profile

,
Craig A. Knoblock

University of Southern, California, Marina del Rey, CA

University of Southern, California, Marina del Rey, CA
View Profile

,
Steven Minton

Fetch Technologies, Marina del Rey, CA

Fetch Technologies, Marina del Rey, CA
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 350–359https://doi.org/10.1145/775047.775099

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 350–359

ABSTRACT

The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Previous methods of object identification have required manual construction of domain-specific string transformations or manual setting of general transformation parameter weights for recognizing format inconsistencies. This manual process can be time consuming and error-prone. We have developed an object identification system called Active Atlas [18], which applies a set of domain-independent string transformations to compare the objects' shared attributes in order to identify matching objects. In this paper, we discuss extensions to the Active Atlas system, which allow it to learn to tailor the weights of a set of general transformations to a specific application domain through limited user input. The experimental results demonstrate that this approach achieves higher accuracy and requires less user involvement than previous methods across various application domains.

References

N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998. Google ScholarDigital Library
Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock. Retrieving and integrating data from multiple information sources. International Journal on Intelligent and Cooperative Information Systems, 2(2):127--158, 1993.Google ScholarCross Ref
D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255--265, June 1983. Google ScholarDigital Library
K. W. Church and W. A. Gale. Probability scoring for spelling correction. Statistics and Computing, 1:93--103, 1991.Google ScholarCross Ref
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, Seattle, WA, 1998. Google ScholarDigital Library
I. P. Fellegi and A. B. Sunter. A theory for record-linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.Google ScholarCross Ref
W. Frakes and R. Baeza-Yates. Information retrieval: Data structures and algorithms. Prentice Hall, 1992. Google ScholarDigital Library
M. Ganesh, J. Sirvastava, and T. Richardson. Mining entity-identification rules for database integration. In Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, pages 291--294, Portland, OR, 1996.Google ScholarDigital Library
M. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, pages 1--31, New York, NY, 1998. Google ScholarDigital Library
J. A. Hylton. Identifying and merging related bibliographic records. M.S. thesis. MIT Laboratory for Computer Science Technical Report 678, 1996. Google ScholarDigital Library
C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. G. Philpot, and S. Tejada. The ariadne approach to web-based information integration. International the Journal on Cooperative Information Systems (IJCIS), Special Issue on Intelligent Information Agents: Theory and Applications, 10(1):145--169, 2001.Google Scholar
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377--439, 1992. Google ScholarDigital Library
S. Lawrence, K. Bollacker, and C. L. Giles. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents, New York, 1999. Google ScholarDigital Library
A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000. Google ScholarDigital Library
A. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on Data Mining and Knowledge Discovery, Tuczon, AZ, 1997.Google Scholar
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998. Google ScholarDigital Library
J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In Fourth International conference on Knowledge Discovery and Data Mining, New York, NY, 1998.Google ScholarDigital Library
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Special Issue on Data Extraction, Cleaning, and Reconciliation, Information Systems Journal, 26(8), 2001. Google ScholarDigital Library
G. Wiederhold. Intelligent integration of information. In Proceedings of ACM SIGMOD conference on management of data, pages 434--437, Washington, DC, May 1993. Google ScholarDigital Library
W. Winkler. Record Linkage Software and Methods for Merging Administrative Lists. Statistical research division Technical Report RR01---03, U.S. Bureau of Census, 2001.Google Scholar
T. W. Yan and H. Garcia-Molina. Duplicate removal in information dissemination. In Proceedings of VLDB, Zurich, Switzerland, 1995. Google ScholarDigital Library

Index Terms

Learning domain-independent string transformation weights for high accuracy object identification

Recommendations

Learning object identification rules for information integration
Read More
Achieving domain generalization for underwater object detection by domain mixup and contrastive learning
Abstract
The performance of existing underwater object detection methods severely degrades when they face the domain shift caused by complicated underwater environments. Due to the limited domain diversity in collected data, deep detectors ...
Read More
Adaptive Cross-domain Learning for Generalizable Person Re-identification
Computer Vision – ECCV 2022
Abstract
Domain Generalizable Person Re-Identification (DG-ReID) is a more practical ReID task that is trained from multiple source domains and tested on the unseen target domains. Most existing methods are challenged for dealing with the shared and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 165
  Total Citations
  View Citations
- 994
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning domain-independent string transformation weights for high accuracy object identification

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning object identification rules for information integration

Achieving domain generalization for underwater object detection by domain mixup and contrastive learning

Adaptive Cross-domain Learning for Generalizable Person Re-identification