ABSTRACT
The task of object identification occurs when integrating information from multiple websites. The same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. Previous methods of object identification have required manual construction of domain-specific string transformations or manual setting of general transformation parameter weights for recognizing format inconsistencies. This manual process can be time consuming and error-prone. We have developed an object identification system called Active Atlas [18], which applies a set of domain-independent string transformations to compare the objects' shared attributes in order to identify matching objects. In this paper, we discuss extensions to the Active Atlas system, which allow it to learn to tailor the weights of a set of general transformations to a specific application domain through limited user input. The experimental results demonstrate that this approach achieves higher accuracy and requires less user involvement than previous methods across various application domains.
- N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998. Google ScholarDigital Library
- Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock. Retrieving and integrating data from multiple information sources. International Journal on Intelligent and Cooperative Information Systems, 2(2):127--158, 1993.Google ScholarCross Ref
- D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255--265, June 1983. Google ScholarDigital Library
- K. W. Church and W. A. Gale. Probability scoring for spelling correction. Statistics and Computing, 1:93--103, 1991.Google ScholarCross Ref
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, Seattle, WA, 1998. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. A theory for record-linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.Google ScholarCross Ref
- W. Frakes and R. Baeza-Yates. Information retrieval: Data structures and algorithms. Prentice Hall, 1992. Google ScholarDigital Library
- M. Ganesh, J. Sirvastava, and T. Richardson. Mining entity-identification rules for database integration. In Proceedings of the Second International Conference on Data Mining and Knowledge Discovery, pages 291--294, Portland, OR, 1996.Google ScholarDigital Library
- M. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, pages 1--31, New York, NY, 1998. Google ScholarDigital Library
- J. A. Hylton. Identifying and merging related bibliographic records. M.S. thesis. MIT Laboratory for Computer Science Technical Report 678, 1996. Google ScholarDigital Library
- C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. G. Philpot, and S. Tejada. The ariadne approach to web-based information integration. International the Journal on Cooperative Information Systems (IJCIS), Special Issue on Intelligent Information Agents: Theory and Applications, 10(1):145--169, 2001.Google Scholar
- K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377--439, 1992. Google ScholarDigital Library
- S. Lawrence, K. Bollacker, and C. L. Giles. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents, New York, 1999. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000. Google ScholarDigital Library
- A. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on Data Mining and Knowledge Discovery, Tuczon, AZ, 1997.Google Scholar
- K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 1998. Google ScholarDigital Library
- J. C. Pinheiro and D. X. Sun. Methods for linking and mining massive heterogeneous databases. In Fourth International conference on Knowledge Discovery and Data Mining, New York, NY, 1998.Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Special Issue on Data Extraction, Cleaning, and Reconciliation, Information Systems Journal, 26(8), 2001. Google ScholarDigital Library
- G. Wiederhold. Intelligent integration of information. In Proceedings of ACM SIGMOD conference on management of data, pages 434--437, Washington, DC, May 1993. Google ScholarDigital Library
- W. Winkler. Record Linkage Software and Methods for Merging Administrative Lists. Statistical research division Technical Report RR01---03, U.S. Bureau of Census, 2001.Google Scholar
- T. W. Yan and H. Garcia-Molina. Duplicate removal in information dissemination. In Proceedings of VLDB, Zurich, Switzerland, 1995. Google ScholarDigital Library
Index Terms
- Learning domain-independent string transformation weights for high accuracy object identification
Recommendations
Achieving domain generalization for underwater object detection by domain mixup and contrastive learning
AbstractThe performance of existing underwater object detection methods severely degrades when they face the domain shift caused by complicated underwater environments. Due to the limited domain diversity in collected data, deep detectors ...
Adaptive Cross-domain Learning for Generalizable Person Re-identification
Computer Vision – ECCV 2022AbstractDomain Generalizable Person Re-Identification (DG-ReID) is a more practical ReID task that is trained from multiple source domains and tested on the unseen target domains. Most existing methods are challenged for dealing with the shared and ...
Comments