skip to main content
10.3115/1219840.1219917dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

Published:25 June 2005Publication History

ABSTRACT

In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.

References

  1. Banko, M. and Brill, E. 2001. Mitigating the paucity of dataproblem. In Proceedings of HLT. 2001. San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Box, G. E. P. and M. E. Muller 1958. Ann. Math. Stat. 29, 610--611.Google ScholarGoogle ScholarCross RefCross Ref
  3. Broder, Andrei 1997. On the Resemblance and Containment of Documents. Proceedings of the Compression and Complexity of Sequences. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cavnar, W. B. and J. M. Trenkle 1994. N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, 161--175.Google ScholarGoogle Scholar
  5. Charikar, Moses 2002. Similarity Estimation Techniques from Rounding Algorithms In Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Curran, J. and Moens, M. 2002. Scaling context space. In Proceedings of ACL-02 pp 231--238, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Goemans, M. X. and D. P. Williamson 1995. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6):1115--1145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hindle, D. 1990. Noun classification from predicate-argument structures. In Proceedings of ACL-90. pp. 268--275. Pittsburgh, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL-98. pp. 768--774. Montreal, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Indyk, P., Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality Proceedings of 30thSTOC, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Kolcz, A. Chowdhury, J. Alspector 2004. Improved robustness of signature-based near-replica detection via lexicon randomization. Proceedings of ACM-SIGKDD (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lin, D. 1994 Principar - an efficient, broad-coverage, principle-based parser. Proceedings of COLING-94, pp. 42--48. Kyoto, Japan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pantel, Patrick and Dekang Lin 2002. Discovering Word Senses from Text. In Proceedings of SIGKDD-02, pp. 613--619. Edmonton, Canada Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rabin, M. O. 1981. Fingerprinting by random polynomials. Center for research in Computing technology, Harvard University, Report TR-15-81.Google ScholarGoogle Scholar
  16. Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    ACL '05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
    June 2005
    657 pages
    • General Chair:
    • Kevin Knight

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 25 June 2005

    Qualifiers

    • Article

    Acceptance Rates

    ACL '05 Paper Acceptance Rate77of423submissions,18%Overall Acceptance Rate85of443submissions,19%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader