ABSTRACT
We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer --- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first.
Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering algorithms which do not easily generalize to finding multiple groupings. We model deduplication as a segmentation of a linear embedding of records and present a polynomial time algorithm for finding the R highest scoring answers. This method closely matches the accuracy of an exact exponential time algorithm on several datasets.
- R. Ananthakrishna, S. chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarDigital Library
- P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006. Google ScholarDigital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarDigital Library
- N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS '02: Proceedings of the 43rd Symposium on Foundations of Computer Science, page 238, Washington, DC, USA, 2002. IEEE Computer Society. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007. Google ScholarDigital Library
- M. Bilenko. Learnable similarity functions and their applications to clustering and record linkage. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25--29, 2004, San Jose, California, USA, pages 981--982. AAAI Press / The MIT Press, 2004. Google ScholarDigital Library
- M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In ICDM, 2005. Google ScholarDigital Library
- M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarDigital Library
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 2003. Google ScholarDigital Library
- M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. J. Comput. Syst. Sci., 71(3):360--383, 2005. Google ScholarDigital Library
- S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007. Google ScholarDigital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005. Google ScholarDigital Library
- D. Cheng, R. Kannan, S. Vempala, and G. Wang. A divide-and-merge methodology for clustering. ACM Trans. Database Syst., 31(4):1499--1525, 2006. Google ScholarDigital Library
- W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.Google Scholar
- W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000. Google ScholarDigital Library
- X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 687--698, 2007. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google ScholarCross Ref
- L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarDigital Library
- I. F. Ilyas, G. Beskales, and M. A. Soliman. Survey of top-k query processing techniques in relational database systems,. To Appear in the ACM Computing Surveys, 2008, 2008. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
- J. Ko, T. Mitamura, and E. Nyberg. Language-independent probabilistic answer ranking for question answering. In ACL, 2007.Google Scholar
- D. Koller and N. Friedman. Structured probabilistic models. Under preparation, 2007.Google Scholar
- Y. Koren and D. Harel. A multi-scale algorithm for the linear arrangement problem. In WG, 2002. Google ScholarDigital Library
- C. Li, K. C.-C. Chang, and I. F. Ilyas. Supporting ad-hoc ranking aggregates. In SIGMOD Conference, 2006. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000. Google ScholarDigital Library
- A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.Google Scholar
- A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.Google Scholar
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Processing Systems 15, Vancouver, British Columbia, 2002. MIT Press.Google Scholar
- S. Sarawagi. The crf project: a java implementation. http://crf.sourceforge.net, 2004.Google Scholar
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), Edmonton, Canada, July 2002. Google ScholarDigital Library
- S. Sarawagi and A. Kirpal. Scaling up the alias duplicate elimination system: A demostration. In Proc. of the 19th IEEE Int'l Conference on Data Engineering (ICDE), Bangalore, March 2003.Google ScholarCross Ref
- S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2004. Google ScholarDigital Library
- A. D. Sarma, X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008. Google ScholarDigital Library
- M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang. Probabilistic top- and ranking-aggregate queries. ACM Trans. Database Syst., 33(3), 2008. Google ScholarDigital Library
- M. L. Wick, K. Rohanimanesh, K. Schultz, and A. McCallum. A unified approach for schema matching, coreference and canonicalization. In KDD, 2008. Google ScholarDigital Library
- X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008. Google ScholarDigital Library
- Efficient top-k count queries over imprecise duplicates
Recommendations
Scalable and efficient processing of top-k multiple-type integrated queries
AbstractIn this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Efficient execution of top-k SPARQL queries
ISWC'12: Proceedings of the 11th international conference on The Semantic Web - Volume Part ITop-k queries, i.e. queries returning the top k results ordered by a user-defined scoring function, are an important category of queries. Order is an important property of data that can be exploited to speed up query processing. State-of-the-art SPARQL ...
Efficient top-k retrieval for user preference queries
SAC '11: Proceedings of the 2011 ACM Symposium on Applied ComputingEfficient retrieval of the most relevant (i.e. top-k) tuples is an important requirement in information systems which access large amounts of data. In general answering a top-k query request means to retrieve the k-objects which score best for an ...
Comments