On the relation between automated essay scoring and modern views of the writing construct
Highlights
► Examines the construct measured by automated essay scoring (AES) systems. ► Establishes and explicates common criticisms of AES. ► Establishes limits of AES in assessing argumentation and rhetorical effectiveness. ► Explains relationship between text production and cognitive resources. ► Established directions for future research through a socio-cognitive approach to AES.
Introduction
Automated essay scoring (AES) has been subject to significant controversy, with significant forces encouraging its adoption in large-scale testing, despite major reservations and significant criticisms.
On the one hand, there is significant support for AES as documented in Shermis and Burstein (2003) and Shermis and Burstein (in press). Arguments for use of AES include reliability and ease of scoring, with AES scores correlating highly with human-scored holistic ratings (Burstein and Chodorow, 2003, Burstein and Chodorow, 2010, Chodorow and Burstein, 2004, Powers et al., 2001). AES systems have been adopted as a second or check scorer in high-volume, high-stakes systems such as the Graduate Record Examination (GRE®), the Test of English as a Foreign Language (TOEFL®, cf. Haberman, 2011), the Graduate Management Admissions Test (GMAT®), among others; AES systems are the primary essay scoring engine in various assessment and instructional products, including Accuplacer®, the Criterion® Online Writing Evaluation Service, and the Pearson Test of English®. AES techniques have also been applied in languages other than English. This includes multilingual applications of the Intellimetric system (Elliot, 2003) and other efforts to develop AES systems in various languages including Chinese (Chang et al., 2006, Chang et al., 2007), French (Lemaire & Dessus, 2001), German (Wild, Stahl, Stermsek, Penya, & Neumann, 2005), Hebrew (Ben-Simon and Cohen, 2011, Cohen et al., 2003), and Spanish (Castro-Castro et al., 2008)
On the other hand, there has been and continues to be significant and vocal opposition to AES, particularly when and where it might replace human scoring, a position elaborated by a variety of authors, as elaborated in Ericsson and Haswell (2006), Herrington and Stanley (2012), and Perelman (2012). A major organization for writing professionals, the Conference on College Composition and Communication (CCCC) explicitly opposes the use of AES. Its position statement on Teaching, Learning, and Assessing Writing in Digital Environments (2004) states:
Because all writing is social, all writing should have human readers, regardless of the purpose of the writing. Assessment of writing that is scored by human readers can take time; machine-reading of placement writing gives quick, almost-instantaneous scoring and thus helps provide the kind of quick assessment that helps facilitate college orientation and registration procedures as well as exit assessments.
The speed of machine-scoring is offset by a number of disadvantages. Writing-to-a-machine violates the essentially social nature of writing: we write to others for social purposes. If a student's first writing-experience at an institution is writing to a machine, for instance, this sends a message: writing at this institution is not valued as human communication—and this in turn reduces the validity of the assessment. Further, since we cannot know the criteria by which the computer scores the writing, we cannot know whether particular kinds of bias may have been built into the scoring. And finally, if high schools see themselves as preparing students for college writing, and if college writing becomes to any degree machine-scored, high schools will begin to prepare their students to write for machines.
We understand that machine-scoring programs are under consideration not just for the scoring of placement tests, but for responding to student writing in writing centers and as exit tests. We oppose the use of machine-scored writing in the assessment of writing.
The articles in this special issue are contributions to dialog about the role of automated writing analysis, a dialog whose extreme poles may be framed, on the one hand, by a position advocating unrestricted use of AES to replace human scoring, and, on the other, by a position advocating complete avoidance of automated methods. However, it would be a mistake to focus only on polar positions. Writing is a complex skill, assessed for various audiences and purposes. AES is one instantiation of a larger universe of methods for automatic writing analysis. A more nuanced view may be necessary to assess accurately how automated essay scoring—and other forms of automated writing evaluation—fit into education and assessment.
This article explores this middle space. In particular, the article considers how technological possibilities defined by automated scoring systems interact with well-articulated views of writing skill that incorporate modern research findings, whether from social or cognitive perspectives. Critical to this exploration—and thus to the argument to be explored—is the complexity of writing skill and the variety of automated methods that can be applied. Automated analysis can more appropriately be applied to some aspects of writing skill than to others. Some of the tensions inherent in the current situation may be resolved as we explicate this complexity and consider how these issues play out in particular scenarios.
It is important to consider the affordances created by current deployments of automated essay evaluation technologies and to analyze how these affordances interact with user requirements and the needs of assessment; it is equally important to recognize that AES technology is not static. We must also consider how the underlying technologies, appropriately configured, could support other use cases where there is less reason to postulate a conflict between writing viewed as a social, communicative practice and writing viewed as an automatized (and automatically scorable) skill.
Section snippets
Established practices: defining writing as a construct to be assessed
Historically, direct writing assessment in the North American context was plagued by an inability of raters to agree on common, consistently applied standards. Elliot (2005), citing among others Brigham (1933), and Noyes, Sale, and Stalnaker (1945), outlines some of these difficulties. Widespread adoption of direct writing assessment (White, 1985) depended upon the establishment of reliable holistic scoring methods (Godshalk, Swineford, and Coffman, 1966), which built in turn upon earlier
An example of the current state of the art: the e-rater scoring engine
The e-rater scoring engine is well documented (Attali and Burstein, 2006, Burstein et al., 2003a). Technical details differ among engines, so caution should be used to generalize from it to other commercial AES systems, but based upon available overviews such as Shermis and Burstein (2003), it seems reasonable to assume that roughly similar sets of features and roughly comparable methods for calculating scores are implemented in the major AES systems.
Fig. 3, following the representation first
Construct representation and common criticisms of AES
As noted in the introduction, the use of AES in writing assessment has been the subject of controversy (Cheville, 2004, Ericsson, 2006, Haswell, 2006, Herrington and Moran, 2006Jones, 2006, McGee, 2006). Three major kinds of issues have been raised:
- •
Fundamental objections focusing on how AES affects the testing situation, since the knowledge that essays will be machine-scored may lead to changes in behavior that undermine the intended construct.
- •
Fundamental objections to the construct measured or
A way forward: toward a socio-cognitive approach to automated writing analysis
One way to conceptualize the issues we have been exploring is that they involve a tension between ease of measurement and construct coverage. On the one hand, the kinds of features currently measured by AES systems focus on a restricted construct that emphasizes text production and applies it primarily to replicating human scoring in a high-stakes context. On the other hand, a richer conception of the writing construct is available from a variety of sources, one that takes into account both the
Note from the Editors of the Special Issue
This article is part of a special issue of Assessing Writing on the automated assessment of writing. The invited articles are intended to provide a comprehensive overview of the design, development, use, applications, and consequences of this technology. Please find the full contents list available online at: http://www.sciencedirect.com/science/journal/10752935.
Paul Deane received his Ph.D. in theoretical linguistics from the University of Chicago (1986). He has published on a variety of subjects including lexical semantics, language and cognition, computational linguistics, and writing assessment and pedagogy. His research focuses on vocabulary assessment, automated essay scoring, and innovative approaches to writing assessment.
References (102)
- et al.
Genres of high stakes writing assessments and the construct of writing competence
Assessing Writing
(2007) - et al.
Automated essay scoring: Writing assessment and instruction
Assessing the portfolio: Hamp-Lyons and Condon [Rev. of the book Assessing the portfolio: Principles for practice, theory, and research, by L. Hamp-Lyons & W. Condon]
Assessing Writing
(2002)Closed systems and standardized writing tests
College Composition and Communication
(2008)- et al.
Performance of a generic approach in automated essay scoring
Journal of Technology, Learning & Assessment
(2010) - et al.
Automated essay scoring with e-rater v.2
Journal of Technology, Learning & Assessment
(2006) - Attali, Y., & Powers, D. (2008). A developmental writing scale (ETS Research Report RR-08-19). Princeton, NJ:...
A validity-based approach to quality control and assurance of automated scoring
Assessment in Education: Principles, Policy & Practice
(2011)- et al.
Portfolios: Process and product
(1991) Moving the field forward: Some thoughts on validity and automated scoring
Automated scoring of constructed-response literacy and mathematics items
Toward more substantively meaningful automated essay scoring
Journal of Technology, Learning & Assessment
The psychology of written composition
New directions in portfolio assessment
Scoring constructed responses using expert systems
Journal of Educational Measurement
Automated essay scoring in the classroom: Finding common ground
Comparison of human and machine scoring of essays: Differences by gender, ethnicity and country
Applied Measurement in Education
The reading of the comprehensive examination in English: An analysis of the procedures followed during the five reading periods from 1929 to 1933
Organic writing assessment: Dynamic criteria mapping in action
Directions in automated essay scoring
Progress and new directions in technology for automated essay evaluation
Criterion: Online essay evaluation: An application for automated evaluation of student essays
Automated evaluation of discourse structure in student essays
Towards automatic classification of discourse elements in essays
Finding the WRITE stuff: Automatic identification of discourse structure in student essays
IEEE Intelligent Systems: Special Issue on Advances in Natural Language Processing
Using entity-based features to model coherence in student essays
A multilingual application for automated essay scoring. Advances in artificial intelligence
Lecture Notes in Computer Science
On issues of feature extraction in Chinese automatic essay scoring system
Enhancing automatic Chinese essay scoring system from figures-of-speech
Fluency in writing: Generating text in L1 and L2
Written Communication
Automated scoring technologies and the rising influence of error
English Journal
The utility of article and preposition error correction systems for English Language Learners: Feedback and assessment
Language Testing
Techniques for detecting syntactic errors in text
Technical Report of the Institute of Electronics, Information, and Communication Engineers
The future of portfolio-based writing assessment: A cautionary tale
Common European framework of reference for languages: Learning, teaching, assessment
Test validation
What automated analyses of corpora can tell us about students’ writing skills
Journal of Writing Research
Factors in judgments of writing ability (ETS research bulletin RB-61-15)
On a scale: A social history of writing assessment in America
IntelliMetric: from here to validity
The meaning of meaning: Is a paragraph more than an equation?
Cited by (152)
Predictors of middle school students’ perceptions of automated writing evaluation
2024, Computers and EducationThreshold optimization of task allocation models in human–machine collaborative scoring of subjective assignments
2024, Computers and Industrial EngineeringAutomated evaluation of the quality of ideas in compositions based on concept maps
2022, Natural Language EngineeringModeling Writing Traits in a Formative Essay Corpus
2024, ETS Research Report Series
Paul Deane received his Ph.D. in theoretical linguistics from the University of Chicago (1986). He has published on a variety of subjects including lexical semantics, language and cognition, computational linguistics, and writing assessment and pedagogy. His research focuses on vocabulary assessment, automated essay scoring, and innovative approaches to writing assessment.