ABSTRACT
This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.
- Aston, G. and Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA, Edinburgh University Press.Google Scholar
- Bentley R., Rodden T., Sawyer P., Sommerville I, Hughes J., Randall D., Shapiro D. (1992). Ethnographically-informed systems design for air traffic control, In Proceedings of Computer-Supported Cooperative Work (CSCW) '92, Toronto, November 1992. Google ScholarDigital Library
- Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8, Issue 4, Oxford University Press, pp. 243--257.Google Scholar
- Clear, J. (1992). Corpus sampling. In G. Leitner (ed.) New directions in English language corpora. Mouton-de-Gruyter, Berlin, pp. 21--31.Google Scholar
- Cressie, N. and Read, T. R. C. (1984) Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 46, No. 3, pp. 440--464.Google ScholarCross Ref
- Cressie, N. and Read, T. R. C. (1989). Pearson's X2 and the Loglikelihood Ratio Statistic G2: A comparative review. International Statistical Review, 57, 1, Belfast University Press, N.I., pp. 19--43.Google Scholar
- Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 1, March 1993, pp. 61--74. Google ScholarDigital Library
- Garside, R. and Smith, N. (1997). A Hybrid Grammatical Tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, Longman, London.Google Scholar
- Granger, S. and Rayson, P. (1998). Automatic profiling of learner texts. In S. Granger (ed.) Learner English on Computer. Longman, London and New York, pp. 119--131.Google Scholar
- Hofland, K. and Johansson, S. (1982). Word frequencies in British and American English. The Norwegian Computing Centre for the Humanities, Bergen, Norway.Google Scholar
- Kilgarriff, A. (1996) Why chi-square doesn't work, and an improved LOB-Brown comparison. ALLC-ACH Conference, June 1996, Bergen, Norway.Google Scholar
- Kilgarriff, A. (1997). Using word frequency lists to measure corpus homogeneity and similarity between corpora. Proceedings 5th ACL workshop on very large corpora. Beijing and Hong Kong.Google Scholar
- Kilgarriff, A. and Rose, T. (1998). Measures for corpus similarity and homogeneity. In proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granada, Spain, pp. 46 -- 52.Google Scholar
- Leech, G. (1993). 100 million words of English: a description of the background, nature and prospects of the British National Corpus project. English Today 33, Vol. 9, No. 1, Cambridge University Press.Google Scholar
- Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.Google Scholar
- Rayson, P., and Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report, In L. J. Evett, and T. G. Rose (eds.) Language Engineering for Document Analysis and Recognition, LEDAR, AISB96 Workshop proceedings, pp. 13--20. Brighton, England.Google Scholar
- Rayson, P., Leech, G., and Hodges, M. (1997). Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics. 2 (1). pp. 133 -- 152. John Benjamins, Amsterdam/Philadelphia.Google ScholarCross Ref
- Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting requirements engineering with semantic document analysis. In Proceedings of RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted. Information Retrieval) International Conference, Collège de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 -- 1371.Google Scholar
- Read, T. R. C. and Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer series in statistics. Springer-Verlag, New York.Google Scholar
- Yule, G. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar
- Comparing corpora using frequency profiling
Recommendations
Comparing corpora with WordSmith tools: how large must the reference corpus be?
WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list ...
Comparing corpora using frequency profiling
CompareCorpora '00: Proceedings of the Workshop on Comparing CorporaThis paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key ...
Comparing corpora with WordSmith tools: how large must the reference corpus be?
CompareCorpora '00: Proceedings of the Workshop on Comparing CorporaWordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list ...
Comments