skip to main content
10.3115/1117729.1117730dlproceedingsArticle/Chapter ViewAbstractPublication PageswccConference Proceedingsconference-collections
Article
Free Access

Comparing corpora using frequency profiling

Published:07 October 2000Publication History

ABSTRACT

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.

References

  1. Aston, G. and Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA, Edinburgh University Press.Google ScholarGoogle Scholar
  2. Bentley R., Rodden T., Sawyer P., Sommerville I, Hughes J., Randall D., Shapiro D. (1992). Ethnographically-informed systems design for air traffic control, In Proceedings of Computer-Supported Cooperative Work (CSCW) '92, Toronto, November 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8, Issue 4, Oxford University Press, pp. 243--257.Google ScholarGoogle Scholar
  4. Clear, J. (1992). Corpus sampling. In G. Leitner (ed.) New directions in English language corpora. Mouton-de-Gruyter, Berlin, pp. 21--31.Google ScholarGoogle Scholar
  5. Cressie, N. and Read, T. R. C. (1984) Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 46, No. 3, pp. 440--464.Google ScholarGoogle ScholarCross RefCross Ref
  6. Cressie, N. and Read, T. R. C. (1989). Pearson's X2 and the Loglikelihood Ratio Statistic G2: A comparative review. International Statistical Review, 57, 1, Belfast University Press, N.I., pp. 19--43.Google ScholarGoogle Scholar
  7. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 1, March 1993, pp. 61--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Garside, R. and Smith, N. (1997). A Hybrid Grammatical Tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, Longman, London.Google ScholarGoogle Scholar
  9. Granger, S. and Rayson, P. (1998). Automatic profiling of learner texts. In S. Granger (ed.) Learner English on Computer. Longman, London and New York, pp. 119--131.Google ScholarGoogle Scholar
  10. Hofland, K. and Johansson, S. (1982). Word frequencies in British and American English. The Norwegian Computing Centre for the Humanities, Bergen, Norway.Google ScholarGoogle Scholar
  11. Kilgarriff, A. (1996) Why chi-square doesn't work, and an improved LOB-Brown comparison. ALLC-ACH Conference, June 1996, Bergen, Norway.Google ScholarGoogle Scholar
  12. Kilgarriff, A. (1997). Using word frequency lists to measure corpus homogeneity and similarity between corpora. Proceedings 5th ACL workshop on very large corpora. Beijing and Hong Kong.Google ScholarGoogle Scholar
  13. Kilgarriff, A. and Rose, T. (1998). Measures for corpus similarity and homogeneity. In proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granada, Spain, pp. 46 -- 52.Google ScholarGoogle Scholar
  14. Leech, G. (1993). 100 million words of English: a description of the background, nature and prospects of the British National Corpus project. English Today 33, Vol. 9, No. 1, Cambridge University Press.Google ScholarGoogle Scholar
  15. Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.Google ScholarGoogle Scholar
  16. Rayson, P., and Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report, In L. J. Evett, and T. G. Rose (eds.) Language Engineering for Document Analysis and Recognition, LEDAR, AISB96 Workshop proceedings, pp. 13--20. Brighton, England.Google ScholarGoogle Scholar
  17. Rayson, P., Leech, G., and Hodges, M. (1997). Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics. 2 (1). pp. 133 -- 152. John Benjamins, Amsterdam/Philadelphia.Google ScholarGoogle ScholarCross RefCross Ref
  18. Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting requirements engineering with semantic document analysis. In Proceedings of RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted. Information Retrieval) International Conference, Collège de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 -- 1371.Google ScholarGoogle Scholar
  19. Read, T. R. C. and Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer series in statistics. Springer-Verlag, New York.Google ScholarGoogle Scholar
  20. Yule, G. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press.Google ScholarGoogle Scholar
  1. Comparing corpora using frequency profiling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9
        October 2000
        49 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 7 October 2000

        Qualifiers

        • Article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader