Article

Free Access

Comparing corpora using frequency profiling

Authors:
Paul Rayson

Lancaster University, Lancaster, UK

Lancaster University, Lancaster, UK
View Profile

,
Roger Garside

Lancaster University, Lancaster, UK

Lancaster University, Lancaster, UK
View Profile

WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9October 2000Pages 1–6https://doi.org/10.3115/1117729.1117730

Published:07 October 2000Publication History

WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9

Pages 1–6

ABSTRACT

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.

References

Aston, G. and Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA, Edinburgh University Press.Google Scholar
Bentley R., Rodden T., Sawyer P., Sommerville I, Hughes J., Randall D., Shapiro D. (1992). Ethnographically-informed systems design for air traffic control, In Proceedings of Computer-Supported Cooperative Work (CSCW) '92, Toronto, November 1992. Google ScholarDigital Library
Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8, Issue 4, Oxford University Press, pp. 243--257.Google Scholar
Clear, J. (1992). Corpus sampling. In G. Leitner (ed.) New directions in English language corpora. Mouton-de-Gruyter, Berlin, pp. 21--31.Google Scholar
Cressie, N. and Read, T. R. C. (1984) Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 46, No. 3, pp. 440--464.Google ScholarCross Ref
Cressie, N. and Read, T. R. C. (1989). Pearson's X2 and the Loglikelihood Ratio Statistic G2: A comparative review. International Statistical Review, 57, 1, Belfast University Press, N.I., pp. 19--43.Google Scholar
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 1, March 1993, pp. 61--74. Google ScholarDigital Library
Garside, R. and Smith, N. (1997). A Hybrid Grammatical Tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, Longman, London.Google Scholar
Granger, S. and Rayson, P. (1998). Automatic profiling of learner texts. In S. Granger (ed.) Learner English on Computer. Longman, London and New York, pp. 119--131.Google Scholar
Hofland, K. and Johansson, S. (1982). Word frequencies in British and American English. The Norwegian Computing Centre for the Humanities, Bergen, Norway.Google Scholar
Kilgarriff, A. (1996) Why chi-square doesn't work, and an improved LOB-Brown comparison. ALLC-ACH Conference, June 1996, Bergen, Norway.Google Scholar
Kilgarriff, A. (1997). Using word frequency lists to measure corpus homogeneity and similarity between corpora. Proceedings 5th ACL workshop on very large corpora. Beijing and Hong Kong.Google Scholar
Kilgarriff, A. and Rose, T. (1998). Measures for corpus similarity and homogeneity. In proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granada, Spain, pp. 46 -- 52.Google Scholar
Leech, G. (1993). 100 million words of English: a description of the background, nature and prospects of the British National Corpus project. English Today 33, Vol. 9, No. 1, Cambridge University Press.Google Scholar
Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.Google Scholar
Rayson, P., and Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report, In L. J. Evett, and T. G. Rose (eds.) Language Engineering for Document Analysis and Recognition, LEDAR, AISB96 Workshop proceedings, pp. 13--20. Brighton, England.Google Scholar
Rayson, P., Leech, G., and Hodges, M. (1997). Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics. 2 (1). pp. 133 -- 152. John Benjamins, Amsterdam/Philadelphia.Google ScholarCross Ref
Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting requirements engineering with semantic document analysis. In Proceedings of RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted. Information Retrieval) International Conference, Collège de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 -- 1371.Google Scholar
Read, T. R. C. and Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. Springer series in statistics. Springer-Verlag, New York.Google Scholar
Yule, G. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar

Comparing corpora using frequency profiling
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Comparing corpora with WordSmith tools: how large must the reference corpus be?
WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9

WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list ...
Read More
Comparing corpora using frequency profiling
CompareCorpora '00: Proceedings of the Workshop on Comparing Corpora

This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key ...
Read More
Comparing corpora with WordSmith tools: how large must the reference corpus be?
CompareCorpora '00: Proceedings of the Workshop on Comparing Corpora

WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9
October 2000
49 pages
Conference Chairs:
Adam Kilgarriff
ITRI, University of Brighton
,
Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 7 October 2000
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,781
  Total Downloads
- Downloads (Last 12 months)79
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparing corpora using frequency profiling

WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9

ABSTRACT

References

Cited By

Recommendations

Comparing corpora with WordSmith tools: how large must the reference corpus be?

Comparing corpora using frequency profiling

Comparing corpora with WordSmith tools: how large must the reference corpus be?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparing corpora using frequency profiling

WCC '00: Proceedings of the workshop on Comparing corpora - Volume 9

ABSTRACT

References

Cited By

Recommendations

Comparing corpora with WordSmith tools: how large must the reference corpus be?

Comparing corpora using frequency profiling

Comparing corpora with WordSmith tools: how large must the reference corpus be?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media