ABSTRACT
In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.
- Banko, M. and Brill, E. 2001. Mitigating the paucity of dataproblem. In Proceedings of HLT. 2001. San Diego, CA. Google ScholarDigital Library
- Box, G. E. P. and M. E. Muller 1958. Ann. Math. Stat. 29, 610--611.Google ScholarCross Ref
- Broder, Andrei 1997. On the Resemblance and Containment of Documents. Proceedings of the Compression and Complexity of Sequences. Google ScholarDigital Library
- Cavnar, W. B. and J. M. Trenkle 1994. N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, 161--175.Google Scholar
- Charikar, Moses 2002. Similarity Estimation Techniques from Rounding Algorithms In Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Google ScholarDigital Library
- Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada. Google ScholarDigital Library
- Curran, J. and Moens, M. 2002. Scaling context space. In Proceedings of ACL-02 pp 231--238, Philadelphia, PA. Google ScholarDigital Library
- Goemans, M. X. and D. P. Williamson 1995. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6):1115--1145. Google ScholarDigital Library
- Hindle, D. 1990. Noun classification from predicate-argument structures. In Proceedings of ACL-90. pp. 268--275. Pittsburgh, PA. Google ScholarDigital Library
- Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL-98. pp. 768--774. Montreal, Canada. Google ScholarDigital Library
- Indyk, P., Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality Proceedings of 30thSTOC, 604--613. Google ScholarDigital Library
- A. Kolcz, A. Chowdhury, J. Alspector 2004. Improved robustness of signature-based near-replica detection via lexicon randomization. Proceedings of ACM-SIGKDD (2004). Google ScholarDigital Library
- Lin, D. 1994 Principar - an efficient, broad-coverage, principle-based parser. Proceedings of COLING-94, pp. 42--48. Kyoto, Japan. Google ScholarDigital Library
- Pantel, Patrick and Dekang Lin 2002. Discovering Word Senses from Text. In Proceedings of SIGKDD-02, pp. 613--619. Edmonton, Canada Google ScholarDigital Library
- Rabin, M. O. 1981. Fingerprinting by random polynomials. Center for research in Computing technology, Harvard University, Report TR-15-81.Google Scholar
- Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Google ScholarDigital Library
Recommendations
Randomized algorithms
This tutorial will illustrate the power of randomization in algorithms. The history of algorithms begins with deterministic algorithms. Deterministic algorithms are more natural, are usually easier to analyze, and are usually easier to understand by ...
IR meets NLP: On the Semantic Similarity between Subject-Verb-Object Phrases
ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information RetrievalMeasuring the semantic similarity between phrases and sentences is an important task in natural language processing (NLP) and information retrieval (IR). We compare the quality of the distributional semantic NLP models against phrase-based semantic IR. ...
Randomized algorithms for comparison-based search
NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing SystemsThis paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query ...
Comments