On the relation between automated essay scoring and modern views of the writing construct

doi:10.1016/j.asw.2012.10.002

Assessing Writing

Volume 18, Issue 1, January 2013, Pages 7-24

https://doi.org/10.1016/j.asw.2012.10.002 Get rights and content

Abstract

This paper examines the construct measured by automated essay scoring (AES) systems. AES systems measure features of the text structure, linguistic structure, and conventional print form of essays; as such, the systems primarily measure text production skills. In the current state-of-the-art, AES provide little direct evidence about such matters as strength of argumentation or rhetorical effectiveness. However, since there is a relationship between ease of text production and ability to mobilize cognitive resources to address rhetorical and conceptual problems, AES systems have strong correlations with overall performance and can effectively distinguish students in a position to apply a broader writing construct from those for whom text production constitutes a significant barrier to achievement. The papers begins by defining writing as a construct and then turns to the e-rater scoring engine as an example of AES state-of-the-art construct measurement. Common criticisms of AES are defined and explicated—fundamental objections to the construct measured, methods used to measure the construct, and technical inadequacies—and a direction for future research is identified through a socio-cognitive approach to AES.

Highlights

► Examines the construct measured by automated essay scoring (AES) systems. ► Establishes and explicates common criticisms of AES. ► Establishes limits of AES in assessing argumentation and rhetorical effectiveness. ► Explains relationship between text production and cognitive resources. ► Established directions for future research through a socio-cognitive approach to AES.

Introduction

Automated essay scoring (AES) has been subject to significant controversy, with significant forces encouraging its adoption in large-scale testing, despite major reservations and significant criticisms.

On the one hand, there is significant support for AES as documented in Shermis and Burstein (2003) and Shermis and Burstein (in press). Arguments for use of AES include reliability and ease of scoring, with AES scores correlating highly with human-scored holistic ratings (Burstein and Chodorow, 2003, Burstein and Chodorow, 2010, Chodorow and Burstein, 2004, Powers et al., 2001). AES systems have been adopted as a second or check scorer in high-volume, high-stakes systems such as the Graduate Record Examination (GRE^®), the Test of English as a Foreign Language (TOEFL^®, cf. Haberman, 2011), the Graduate Management Admissions Test (GMAT^®), among others; AES systems are the primary essay scoring engine in various assessment and instructional products, including Accuplacer^®, the Criterion^® Online Writing Evaluation Service, and the Pearson Test of English^®. AES techniques have also been applied in languages other than English. This includes multilingual applications of the Intellimetric system (Elliot, 2003) and other efforts to develop AES systems in various languages including Chinese (Chang et al., 2006, Chang et al., 2007), French (Lemaire & Dessus, 2001), German (Wild, Stahl, Stermsek, Penya, & Neumann, 2005), Hebrew (Ben-Simon and Cohen, 2011, Cohen et al., 2003), and Spanish (Castro-Castro et al., 2008)

On the other hand, there has been and continues to be significant and vocal opposition to AES, particularly when and where it might replace human scoring, a position elaborated by a variety of authors, as elaborated in Ericsson and Haswell (2006), Herrington and Stanley (2012), and Perelman (2012). A major organization for writing professionals, the Conference on College Composition and Communication (CCCC) explicitly opposes the use of AES. Its position statement on Teaching, Learning, and Assessing Writing in Digital Environments (2004) states:

Because all writing is social, all writing should have human readers, regardless of the purpose of the writing. Assessment of writing that is scored by human readers can take time; machine-reading of placement writing gives quick, almost-instantaneous scoring and thus helps provide the kind of quick assessment that helps facilitate college orientation and registration procedures as well as exit assessments.

The speed of machine-scoring is offset by a number of disadvantages. Writing-to-a-machine violates the essentially social nature of writing: we write to others for social purposes. If a student's first writing-experience at an institution is writing to a machine, for instance, this sends a message: writing at this institution is not valued as human communication—and this in turn reduces the validity of the assessment. Further, since we cannot know the criteria by which the computer scores the writing, we cannot know whether particular kinds of bias may have been built into the scoring. And finally, if high schools see themselves as preparing students for college writing, and if college writing becomes to any degree machine-scored, high schools will begin to prepare their students to write for machines.

We understand that machine-scoring programs are under consideration not just for the scoring of placement tests, but for responding to student writing in writing centers and as exit tests. We oppose the use of machine-scored writing in the assessment of writing.

The articles in this special issue are contributions to dialog about the role of automated writing analysis, a dialog whose extreme poles may be framed, on the one hand, by a position advocating unrestricted use of AES to replace human scoring, and, on the other, by a position advocating complete avoidance of automated methods. However, it would be a mistake to focus only on polar positions. Writing is a complex skill, assessed for various audiences and purposes. AES is one instantiation of a larger universe of methods for automatic writing analysis. A more nuanced view may be necessary to assess accurately how automated essay scoring—and other forms of automated writing evaluation—fit into education and assessment.

This article explores this middle space. In particular, the article considers how technological possibilities defined by automated scoring systems interact with well-articulated views of writing skill that incorporate modern research findings, whether from social or cognitive perspectives. Critical to this exploration—and thus to the argument to be explored—is the complexity of writing skill and the variety of automated methods that can be applied. Automated analysis can more appropriately be applied to some aspects of writing skill than to others. Some of the tensions inherent in the current situation may be resolved as we explicate this complexity and consider how these issues play out in particular scenarios.

It is important to consider the affordances created by current deployments of automated essay evaluation technologies and to analyze how these affordances interact with user requirements and the needs of assessment; it is equally important to recognize that AES technology is not static. We must also consider how the underlying technologies, appropriately configured, could support other use cases where there is less reason to postulate a conflict between writing viewed as a social, communicative practice and writing viewed as an automatized (and automatically scorable) skill.

Section snippets

Established practices: defining writing as a construct to be assessed

Historically, direct writing assessment in the North American context was plagued by an inability of raters to agree on common, consistently applied standards. Elliot (2005), citing among others Brigham (1933), and Noyes, Sale, and Stalnaker (1945), outlines some of these difficulties. Widespread adoption of direct writing assessment (White, 1985) depended upon the establishment of reliable holistic scoring methods (Godshalk, Swineford, and Coffman, 1966), which built in turn upon earlier

An example of the current state of the art: the e-rater scoring engine

The e-rater scoring engine is well documented (Attali and Burstein, 2006, Burstein et al., 2003a). Technical details differ among engines, so caution should be used to generalize from it to other commercial AES systems, but based upon available overviews such as Shermis and Burstein (2003), it seems reasonable to assume that roughly similar sets of features and roughly comparable methods for calculating scores are implemented in the major AES systems.

Fig. 3, following the representation first

Construct representation and common criticisms of AES

As noted in the introduction, the use of AES in writing assessment has been the subject of controversy (Cheville, 2004, Ericsson, 2006, Haswell, 2006, Herrington and Moran, 2006Jones, 2006, McGee, 2006). Three major kinds of issues have been raised:

•
Fundamental objections focusing on how AES affects the testing situation, since the knowledge that essays will be machine-scored may lead to changes in behavior that undermine the intended construct.
•
Fundamental objections to the construct measured or

A way forward: toward a socio-cognitive approach to automated writing analysis

One way to conceptualize the issues we have been exploring is that they involve a tension between ease of measurement and construct coverage. On the one hand, the kinds of features currently measured by AES systems focus on a restricted construct that emphasizes text production and applies it primarily to replicating human scoring in a high-stakes context. On the other hand, a richer conception of the writing construct is available from a variety of sources, one that takes into account both the

Note from the Editors of the Special Issue

This article is part of a special issue of Assessing Writing on the automated assessment of writing. The invited articles are intended to provide a comprehensive overview of the design, development, use, applications, and consequences of this technology. Please find the full contents list available online at: http://www.sciencedirect.com/science/journal/10752935.

Paul Deane received his Ph.D. in theoretical linguistics from the University of Chicago (1986). He has published on a variety of subjects including lexical semantics, language and cognition, computational linguistics, and writing assessment and pedagogy. His research focuses on vocabulary assessment, automated essay scoring, and innovative approaches to writing assessment.

References (102)

S.W. Beck et al.
Genres of high stakes writing assessments and the construct of writing competence
Assessing Writing
(2007)
M.D. Shermis et al.
Automated essay scoring: Writing assessment and instruction
M.K. Willard-Traub
Assessing the portfolio: Hamp-Lyons and Condon [Rev. of the book Assessing the portfolio: Principles for practice, theory, and research, by L. Hamp-Lyons & W. Condon]
Assessing Writing
(2002)
C.M. Anson
Closed systems and standardized writing tests
College Composition and Communication
(2008)
Y. Attali et al.
Performance of a generic approach in automated essay scoring
Journal of Technology, Learning & Assessment
(2010)
Y. Attali et al.
Automated essay scoring with e-rater v.2
Journal of Technology, Learning & Assessment
(2006)
Attali, Y., & Powers, D. (2008). A developmental writing scale (ETS Research Report RR-08-19). Princeton, NJ:...
I.I. Bejar
A validity-based approach to quality control and assurance of automated scoring
Assessment in Education: Principles, Policy & Practice
(2011)
P. Belanoff et al.
Portfolios: Process and product
(1991)
R.E. Bennett
Moving the field forward: Some thoughts on validity and automated scoring

R.E. Bennett

Automated scoring of constructed-response literacy and mathematics items

(2011)

Bennett, R. E. (2011b). CBAL: Results from piloting innovative K-12 assessments (ETS Research Report RR-11-23)....

A. Ben-Simon et al.

Toward more substantively meaningful automated essay scoring

Journal of Technology, Learning & Assessment

(2007)

Ben-Simon, A., & Cohen, Y. (2011). The Hebrew Language Project: Automated essay scoring & readability analysis. Paper...

C. Bereiter et al.

The psychology of written composition

(1987)

L. Black et al.

New directions in portfolio assessment

(1994)

H.I. Braun et al.

Scoring constructed responses using expert systems

Journal of Educational Measurement

(1990)

E. Brent et al.

Automated essay scoring in the classroom: Finding common ground

B. Bridgeman et al.

Comparison of human and machine scoring of essays: Differences by gender, ethnicity and country

Applied Measurement in Education

(2012)

C.C. Brigham

The reading of the comprehensive examination in English: An analysis of the procedures followed during the five reading periods from 1929 to 1933

(1933)

B. Broad et al.

Organic writing assessment: Dynamic criteria mapping in action

(2009)

J. Burstein et al.

Directions in automated essay scoring

J. Burstein et al.

Progress and new directions in technology for automated essay evaluation

J. Burstein et al.

Criterion: Online essay evaluation: An application for automated evaluation of student essays

J. Burstein et al.

Automated evaluation of discourse structure in student essays

J. Burstein et al.

Towards automatic classification of discourse elements in essays

J. Burstein et al.

Finding the WRITE stuff: Automatic identification of discourse structure in student essays

IEEE Intelligent Systems: Special Issue on Advances in Natural Language Processing

(2003)

J. Burstein et al.

Using entity-based features to model coherence in student essays

D. Castro-Castro et al.

A multilingual application for automated essay scoring. Advances in artificial intelligence

Lecture Notes in Computer Science

(2008)

T.-H. Chang et al.

On issues of feature extraction in Chinese automatic essay scoring system

T.-H. Chang et al.

Enhancing automatic Chinese essay scoring system from figures-of-speech

N.A. Chenoweth et al.

Fluency in writing: Generating text in L1 and L2

Written Communication

(2001)

J. Cheville

Automated scoring technologies and the rising influence of error

English Journal

(2004)

Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater's performance on TOEFL® Essays (TOEFL®...

M. Chodorow et al.

The utility of article and preposition error correction systems for English Language Learners: Feedback and assessment

Language Testing

(2010)

M. Chodorow et al.

Techniques for detecting syntactic errors in text

Technical Report of the Institute of Electronics, Information, and Communication Engineers

(2002)

Cohen, Y., Ben-Simon, A., & Hovav, M. (2003). The effect of specific language features on the complexity of systems for...

Conference on College Composition and Communication (2004). Position statement on teaching, learning, and assessing...

W. Condon

The future of portfolio-based writing assessment: A cautionary tale

Council of Europe

Common European framework of reference for languages: Learning, teaching, assessment

(2001)

Council of Writing Program Administrators, National Council of Teachers of English & National Writing Project. (2011)....

L.J. Cronbach

Test validation

Deane, P. (2011). Writing assessment and cognition (ETS Research Report RR-11-14). Princeton, NJ: Educational Testing...

P. Deane et al.

What automated analyses of corpora can tell us about students’ writing skills

Journal of Writing Research

(2010)

P.B. Diederich et al.

Factors in judgments of writing ability (ETS research bulletin RB-61-15)

(1961)

N. Elliot

On a scale: A social history of writing assessment in America

(2005)

S. Elliot

IntelliMetric: from here to validity

P.F. Ericsson

The meaning of meaning: Is a paragraph more than an equation?

Cited by (152)

Visualizing formative feedback in statistics writing: An exploratory study of student motivation using DocuScope Write & Audit
2024, Assessing Writing
Recently, formative feedback in writing instruction has been supported by technologies generally referred to as Automated Writing Evaluation tools. However, such tools are limited in their capacity to explore specific disciplinary genres, and they have shown mixed results in student writing improvement. We explore how technology-enhanced writing interventions can positively affect student attitudes toward and beliefs about writing, both reinforcing content knowledge and increasing student motivation. Using a student-facing text-visualization tool called Write & Audit, we hosted revision workshops for students (n = 30) in an introductory-level statistics course at a large North American University. The tool is designed to be flexible: instructors of various courses can create expectations and predefine topics that are genre-specific. In this way, students are offered non-evaluative formative feedback which redirects them to field-specific strategies. To gauge the usefulness of Write & Audit, we used a previously validated survey instrument designed to measure the construct model of student motivation (Ling et al. 2021). Our results show significant increases in student self-efficacy and beliefs about the importance of content in successful writing. We contextualize these findings with data from three student think-aloud interviews, which demonstrate metacognitive awareness while using the tool. Ultimately, this exploratory study is non-experimental, but it contributes a novel approach to automated formative feedback and confirms the promising potential of Write & Audit.
Predictors of middle school students’ perceptions of automated writing evaluation
2024, Computers and Education
This study examined middle school students' perceptions of an automated writing evaluation (AWE) system, MI Write. We summarize students' perceptions of MI Write's usability, usefulness, and desirability both quantitatively and qualitatively. We then estimate hierarchical entry regression models that account for district context, classroom climate, demographic factors (i.e., gender, special education status, limited English proficiency status, socioeconomic status, grade), students' writing-related beliefs and affect, and students' writing proficiency as predictors of students' perceptions. Controlling for districts, students reporting more optimal classroom climate also reported higher usability, usefulness, and desirability for MI Write. Also, model results revealed that eighth graders, students with limited English proficiency, and students of lower socioeconomic status perceived MI Write relatively more useable; students with lower socioeconomic status also perceived MI Write relatively more useful and desirable. Students who liked writing more and more strongly believed that writing is a recursive process viewed MI Write as more useable, useful, and desirable. Students with greater writing proficiency viewed MI Write as less useable and useful; writing proficiency was not related to desirability perceptions. We conclude with a discussion of implications and future directions.
Threshold optimization of task allocation models in human–machine collaborative scoring of subjective assignments
2024, Computers and Industrial Engineering
In the exploration of human–machine collaborative scoring of subjective assignment (HMCSSA) problems, it is crucial to note that the workloads assigned to humans and machines are determined by thresholds at various levels of granularity within the human–machine task allocation model. To address this issue, a three-phase framework for optimization and decision making in HMCSSA problems is constructed by combining multi-objective evolutionary algorithms (MOEAs) with multi-attribute decision making method. Specifically, we present a bi-objective threshold optimization model to achieve a trade-off between human costs and scoring fairness in HMCSSA problems. Moreover, four well-known MOEAs are employed for solving the optimization model to search the Pareto optimal thresholds. Meanwhile, the technique for order of preference by similarity to ideal solution (TOPSIS) method is utilized to determine the best human–machine task allocation schemes based on the preferences of decision makers. The numerical experiments are conducted on eight prompts of the ASAP data set to validate the effectiveness and superiority of the proposed framework. In particular, the best task allocation scheme attains the fairness gain of 53.76% with the human participation rate of 21.77% on average.
Automated evaluation of the quality of ideas in compositions based on concept maps
2022, Natural Language Engineering
Multi-discourse Modes in Student Writing: Effects of Combining Narrative and Argument Discourse Modes on Argumentative Essay Scores
2024, Applied Linguistics
Modeling Writing Traits in a Formative Essay Corpus
2024, ETS Research Report Series

View all citing articles on Scopus

View full text

On the relation between automated essay scoring and modern views of the writing construct

Abstract

Highlights

Introduction

Section snippets

Established practices: defining writing as a construct to be assessed

An example of the current state of the art: the e-rater scoring engine

Construct representation and common criticisms of AES

A way forward: toward a socio-cognitive approach to automated writing analysis

Note from the Editors of the Special Issue

Assessing Writing

Assessing Writing

Closed systems and standardized writing tests

College Composition and Communication

Performance of a generic approach in automated essay scoring

Journal of Technology, Learning & Assessment

Automated essay scoring with e-rater v.2

Journal of Technology, Learning & Assessment

A validity-based approach to quality control and assurance of automated scoring

Assessment in Education: Principles, Policy & Practice

Portfolios: Process and product

Moving the field forward: Some thoughts on validity and automated scoring

Automated scoring of constructed-response literacy and mathematics items

Toward more substantively meaningful automated essay scoring

Journal of Technology, Learning & Assessment

The psychology of written composition

New directions in portfolio assessment

Scoring constructed responses using expert systems

Journal of Educational Measurement

Automated essay scoring in the classroom: Finding common ground

Comparison of human and machine scoring of essays: Differences by gender, ethnicity and country

Applied Measurement in Education

The reading of the comprehensive examination in English: An analysis of the procedures followed during the five reading periods from 1929 to 1933

Organic writing assessment: Dynamic criteria mapping in action

Directions in automated essay scoring

Progress and new directions in technology for automated essay evaluation

Criterion: Online essay evaluation: An application for automated evaluation of student essays

Automated evaluation of discourse structure in student essays

Towards automatic classification of discourse elements in essays

Finding the WRITE stuff: Automatic identification of discourse structure in student essays

IEEE Intelligent Systems: Special Issue on Advances in Natural Language Processing

Using entity-based features to model coherence in student essays

A multilingual application for automated essay scoring. Advances in artificial intelligence

Lecture Notes in Computer Science

On issues of feature extraction in Chinese automatic essay scoring system

Enhancing automatic Chinese essay scoring system from figures-of-speech

Fluency in writing: Generating text in L1 and L2

Written Communication

Automated scoring technologies and the rising influence of error

English Journal

The utility of article and preposition error correction systems for English Language Learners: Feedback and assessment

Language Testing

Techniques for detecting syntactic errors in text

Technical Report of the Institute of Electronics, Information, and Communication Engineers

The future of portfolio-based writing assessment: A cautionary tale

Common European framework of reference for languages: Learning, teaching, assessment

Test validation

What automated analyses of corpora can tell us about students’ writing skills

Journal of Writing Research

Factors in judgments of writing ability (ETS research bulletin RB-61-15)

On a scale: A social history of writing assessment in America

IntelliMetric: from here to validity

The meaning of meaning: Is a paragraph more than an equation?