Introduction

As part of instruction in many subject-matter areas, students are often asked to demonstrate their understanding by responding to open-ended questions. In science, students may be asked to learn about the causes of phenomena such as volcanic eruptions, ice ages, el Niño, skin cancer, coral bleaching, or global warming, so that they might construct mental models of how or why these things happen. From a Socratic perspective, one ideal educational context for this learning to take place in would be with a 1:1 teacher-to-student ratio, where each student could articulate their understanding to an instructor in a face-to-face setting, and the instructor could give the students feedback on their mental models, help them to repair or remediate their misconceptions, and prompt them to be more coherent, complete, or focused in their responses. Yet, the realities of instruction are far from this ideal. Our public educational system does not have the resources to provide 1:1 human tutoring for all students in all subjects all of the time. As a much more feasible alternative, student understanding is often assessed by closed-ended tests that only require recognition or verification of ideas on the part of the student, and can be easily scored. Another alternative for assessment is asking students to demonstrate their understanding in writing, by composing responses to open-ended questions, including explanations of how or why things happen. Because of the importance of developing student skills in written communication and explanation, and because prompting students to articulate an explanation can provide a sensitive measure of student understanding, explanation essays are a valuable way of assessing student learning. However, these explanations still need to be evaluated for their quality. Developing automated coding systems that can recognize the quality of student understanding in written responses following reading assignments is one possible way to close the teacher-to-student-ratio gap. New technologies offer the promise of better individualized assessment, which may allow for tailored feedback and support during the learning process, which in turn may ultimately support better student performance and understanding.

A substantial body of work has explored hand coding and automated coding for the quality of student writing in response to composition prompts (Crossley et al. 2015; Crossley and McNamara 2010, 2011; McNamara et al. 2015). In this research, students are asked to write expositions on themes or persuasive essays on a topic. The main goal of this research has been finding reliable predictors for the quality of writing as assessed by independent expert raters (Huot 1996). Students are not given texts to read or specific content to learn, but rather are asked to expound upon a topic based on their prior knowledge and opinion. This closely mimics what students experience in the classroom as well as outside of the classroom as part of placement and exit assessments for writing skills. This type of skills assessment is quite a different enterprise than using students’ written responses to evaluate the quality of their understanding of a topic from a learning activity or from a particular set of readings. Findings from prior studies that have been concerned with predicting perceived writing quality in student compositions may or may not be relevant for predicting student understanding from written responses. At present, there is much less work that has explored the features of student writing that are predictive of their understanding of a topic. Correspondingly, there has been a recent push to consider the disciplinary context and the goals of the written product as part of the assessment process (Ferris 2007; Huot 1996; Sommers 2008).

In the present work, we describe the results from a variety of approaches that were used to evaluate the quality of explanations that were written as part of a multiple-document inquiry unit on global warming, where students were tasked with understanding how and why recent patterns of average global temperature differ from those seen in the past. In overview, the main goal was to develop and compare various approaches to assessing the quality of the mental models that students had constructed from the reading activity, by coding responses to an open-ended explanation essay prompt, and using test scores on a closed-ended comprehension test as the criterion measure of their understanding of the material. The writing activity did not involve a general assessment of writing quality. The coding attempts ranged from identification of specific sets of attributes present in the explanations (e.g., concepts from a causal model), to more global or holistic evaluations of explanation quality (e.g., causal language), and from hand-coded scores to attempts to automate the scoring process using both existing technologies (LSA and Coh-Metrix) as well as a tailored machine learning approach specific to this inquiry task. The main question of interest is which approaches to coding the explanations would best capture the quality of each student’s understanding of the subject matter.

We first present results of prior studies that have used various methods of hand-scoring of explanation essays to provide background for the coding and analyses that are employed in the current study. Next, we review prior work using existing out-of-the-box technologies (LSA and Coh-Metrix) to outline how those systems may be used for automatic detection of the hand-coded features. Before presenting the results of both hand coding and LSA/Coh-Metrix measures in terms of simple correlations between features of the essays and student understanding as measured by the criterial test of understanding (the closed-ended comprehension test), we provide relevant details about the sample and methods. This sets the stage for the main analyses in which regressions are used to examine which coded features of the written responses best predict student understanding. Finally, we describe a machine-learning approach that was developed to capture information about the arguments that students wrote for this specific inquiry activity and document set, and the extent to which it and other automated approaches can be used in combination to best predict explanation quality and student understanding.

Background

Hand Scoring of Explanation Quality and Student Understanding

Prior work done specifically on learning from multiple-document reading and writing activities has examined the extent to which students transform the original text when they are asked to write a response to an inquiry question and whether they attempt to develop integrated causal models as part of understanding the readings (Britt and Aglinskas 2002; Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1996, 1999). Several general aspects of students’ responses have been considered in prior work: the organization or structure of their answers, the selection of the information that is included in the answers, and the integration or transformation of that information. Specific analyses of students’ written responses have included the following features: a) the length of the response, (b) references to sources, (c) the organization or macrostructure of the response in relation to the prompt (i.e., listing of ideas versus analytical essay, use of evidence to support a claim); (d) the completeness of the account (i.e., the extent to which idea units mentioned in the document set are included in students’ essays, or key concepts from the causal model); e) the integration and transformation of information within the account (i.e., number of causal connections or connectives present in the essays; proportion of sentences taken directly or paraphrased from sources, versus transformed or completely novel information). These features are the attributes that are focused on in the present work.

One frequently analyzed aspect of written responses is their length. Essay length is generally operationalized as the number of words or number of sentences, and may positively predict essay quality as this feature can signify more complete understanding (Page 1994). However, length is not always an indicator of better understanding, especially when students are asked to summarize rather than just recall or report what they have read (Wade-Stein and Kintsch 2004). We would expect that students who write very short explanations will be unlikely to provide coverage of the causal model in their essay, but it is unclear whether student understanding will always positively correlate with essay length.

Other work has been concerned with whether students explicitly cite sources in the essays (Britt and Aglinskas 2002; Rouet et al. 1996, 1997), include information from many documents (Britt et al. 2004), or use information from multiple texts to support their claims on a controversy (Rouet et al. 1996). The presence of citations when writing from multiple documents in history is usually related to better quality essays (e.g., Britt and Aglinskas 2002). We included this feature in the hand coding to examine the extent to which citations would predict understanding from this science unit.

Another aspect of written essays that has been explored in prior work is the organization or top-level structure (c.f. Wiley and Voss 1999). Using Meyer’s (Meyer 1985) taxonomy, essays can be classified as either having a collective structure (that is, the essay consists of a listing of ideas with minimal focus) or a more analytic or causal structure (having a main claim, thesis, or conclusion, with information organized in relation to that main claim). Studies have found that students who demonstrate better understanding of the material on comprehension tests write essays that are more likely to have an analytic or causal macrostructure (Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1999). Because in the present study students were explicitly prompted to write an explanation about how and why recent patterns of temperature differ from the past, examining the macrostructure provides a measure of whether students attempted to write an essay that directly answered the question.

To code for coverage, researchers may engage in a discourse analysis of the original reading material to identify a finite set of idea units that are present (Perfetti et al. 1995; Rouet et al. 1996; Wiley and Voss 1999). Alternatively, researchers may identify a set of core causal concepts or a subset of idea units that are most important or critical for developing an appropriate mental model of the phenomenon (Griffin et al. 2012; Jaeger and Wiley 2015; Sanchez and Wiley 2006, 2009, 2010, 2014; Wiley et al. 2009, 2011). Sometimes the concepts from the a priori causal model are further differentiated into proximal versus distal causes (Wiley et al. 2014) and often codes are created to document the number of misconceptions or erroneous causes included in the essay (Wiley et al. 2009, 2011, 2014). Other coverage codes can identify non-central content including discussion of background information or non-essential details (irrelevant elaborations) as part of the essay (Perfetti et al. 1995; Wiley et al. 2014). Coverage (number of overall idea units) generally does not predict learning, but significant correlations are typically observed between comprehension test scores and coverage of the key ideas identified as part of an a priori causal model (Wiley et al. 2009, 2011). Negative correlations can be seen when essays include misconceptions (Hemmerich and Wiley 2002). In the present study, the inclusion of key ideas from the causal model can represent an index of the quality of a student’s mental model, and we would expect it to predict performance on the comprehension test.

The final dimension of essay quality considered in this study builds on the work of Scardamalia and Bereiter (1987) and Spivey (1990), who made a distinction between knowledge-telling and knowledge-transforming when students compose essays to demonstrate their understanding of a topic. Telling is regarded as a passive transfer of information from text to paper, whereas transformation is regarded as a more active and constructive process in which the writer relates the contents of sources in new ways by making novel connections within source material, as well as connections to the writer’s knowledge. Knowledge-telling involves a relatively superficial interaction with the text, whereas knowledge-transforming involves more active construction of a mental model from the text contents. Several measures have been developed with the goal of assessing the extent to which students attempt to integrate and transform information as they write. One measure has been the incidence of connections and connectives included in each essay (Britt and Aglinskas 2002; Voss and Wiley 1997; Wiley and Voss 1999). This serves as an index of the extent to which students attempt to connect or integrate ideas, rather than just reporting what they read. Students who demonstrate better understanding of the material on comprehension tests tend to write essays that have more connected ideas, and more causal connections (Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1999).

Another measure of integration and transformation (based on Greene 1994) considers the origin of information included in each sentence of an essay. In this approach, each sentence is scored as to whether or not it contains a connection between idea units that were presented in the reading materials. This measure represents the extent to which students recognize possible relations among factors. The connections that are generally included in this analysis are attributions, correlations, temporal links, simple conjunctions, and causal links. Ideas that co-occur in the same sentence, even without a connective term, can also be coded as connected. These sentences show that the reader has connected and integrated information within a sentence. This is similar to coding for the incidence of connections, but doing it on a per sentence basis. In this approach, the content of each sentence is classified into one of three categories: transformed, added, or borrowed (Wiley and Voss 1996, 1999). Sentences that combine some presented information with a new claim or fact, or that integrate two bits of presented information that were not previously connected, are classified as transformed. A sentence is coded as added when it contains only novel information. Sentences that are taken directly from, or are paraphrased from, the original material are classified as borrowed. Students who demonstrate better understanding of the material on comprehension tests write essays that contain a lower proportion of borrowed or copied sentences (Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1999). Thus, in the present study we would expect the number of connections that students include in their essays to positively predict understanding, while borrowing or copying of information might be a negative predictor.

Automated Scoring of Explanation Quality and Student Understanding from Existing Technologies

Given the kinds of features of student explanations that have been explored using hand scoring, an obvious question is whether there might be existing technologies that can provide automated metrics for each of them. One simple approach used in many automatic scoring approaches has been to use the length of the essay as a measure. Length is easily obtained from automated systems, as well as from basic text editors and word processing programs. Another measure that can be easily automated using simple pattern matching approaches is computing the frequencies of citations or references to documents (Britt et al. 2004; Foltz et al. 1996).

The more difficult features to automatically generate are those that attempt to capture the quality of student explanations, especially in terms of their macrostructure or causal structure. Much of the previous work that has attempted to detect student understanding of subject matter from written responses has been done within Intelligent Tutoring Systems (ITS) such as AutoTutor and MetaTutor. In these cases, students provide written responses as part of a tutoring dialogue, and the goal of assessment is determining which feedback or instructional scaffolds should be given to the tutee by the ITS (e.g. Graesser et al. 2000, 2005; Lintean et al. 2011). In this work, a common assessment method has been to assess the similarity of each student response to a set of idealized target responses using Latent Semantic Analysis (LSA, Landauer et al. 1998). This type of approach generally does well at identifying the content material present in written responses. In the right type of discourse context, feedback based on this automated assessment of similarity to idealized responses can be very effective for helping students learn (VanLehn et al. 2007). It must be noted, however, that work done in these contexts generally requires students to only write very short responses (a word, phrase or sentence), so the difficulties of identifying larger elements of structure from the responses do not apply.

There have been some attempts to use LSA and ITS methods with essays and longer texts. For instance, LSA has been used to analyze the quality of student understanding by comparing student essays to expert essays (Foltz et al. 1996), or to sentences judged important by experts (Foltz et al. 1996), or to idealized peer essays (Ventura et al. 2004). A similar approach is attempted here using an idealized peer essay. This essay is referred to as idealized to emphasize that the fact that the text we use for automatic similarity scoring in LSA is not an actual essay written by an individual student, but a compilation essay made from combining several peer responses that provide full coverage of an a priori model. In this way, LSA can be used to provide an index for the quality of a student’s mental model, and this index should positively predict student understanding as assessed by performance on the comprehension test.

Similarly, LSA can be used to assess the amount of transformation present in student essays by directly comparing student essays to the source documents that they read (Britt et al. 2004; Foltz et al. 1996). In these studies, student sentences that had an LSA cosine with a source sentence above an empirically determined threshold were identified as borrowed, copied or plagiarized unless proper citation was detected (using pattern matching). This approach also enabled calculation of a “coverage” score: the extent to which the student essay said something like what was in each of the source documents, using a lower cosine threshold (Hastings et al. 2012).

Coh-Metrix (McNamara et al. 2014) is another system that has been previously employed in attempts to evaluate student writing, primarily with the goal of assessing the composition quality of persuasive essays written in response to SAT-style prompts such as “Do images and impressions have a positive or negative effect on people?”. In a study attempting to identify which Coh-Metrix indices best predict expert evaluations of student writing, Crossley and McNamara (2011) reported that both essay length and lexical diversity were positive predictors. Students who wrote longer essays and used more diverse vocabulary were given better scores by human raters. Further, these authors have also found that similarity among adjacent sentences and the presence of causal language can serve as negative predictors of expert ratings for composition quality (Crossley and McNamara 2010). However, these Coh-Metrix indices (lexical diversity, similarity among sentences, and presence of causal language) may predict student understanding differently than they predict expert ratings of writing quality.

As noted above, prior work predicts positive relations may be found between these indices (similarity and causal language) and comprehension test performance, because the extent to which students integrate ideas across sentences and use causal connectives has already been shown to relate to better student understanding (e.g. Wiley and Voss 1999). Similarly, although using diverse vocabulary may relate to higher grades on writing assignments that may be similar to those given in an English composition class, the use of too many different terms in response to an inquiry activity for learning in science may mean that a student is not focusing on creating a coherent explanation from the sources. This suggests that lexical diversity might not have a strong positive relationship in this context. One could imagine that too little lexical diversity could be a sign of poorly developed explanations, but that too much diversity could be a sign of a lack of coherence or lack of focus on the inquiry goal of explaining a particular phenomenon or outcome using a particular set of source documents. Lexical diversity has been suggested to reflect the coherence of a writer’s mental representation about an event or topic (Pennebaker 1993; Wade-Stein and Kintsch 2004). That is, a person writing coherently about a single topic will use more of the same words than someone whose writing is more scattered. Thus, in the present study, to the extent that these indices relate to integration and transformation of information, and represent a focus on the development of a causal mental model of the topic, we would expect lexical diversity to negatively predict student understanding, and similarity and causal language to positively predict understanding.

Although LSA and Coh-Metrix may both provide some useful indices of writing quality, in general it has been suggested that generic cohesion-based approaches without grounding in content have not fared as well as more specific content-based approaches (Graesser and McNamara 2012; Magliano and Graesser 2012). Because of this, one might assume that out-of-the-box analyses using LSA or Coh-Metrix will be unable to capture students’ mental models very well, and that a specially tailored machine-learning approach is likely to be needed to robustly predict student understanding from this inquiry activity. However, rather than making this assumption, the current study first tested the extent to which these available technologies might be able to detect student understanding from the written responses, before proceeding to develop and test a machine-learning approach.

Methods

Learning Context and Learning Outcome Measures

The dataset consisted of 178 explanation essays generated by middle school and high school students who learned about the causes of global warming as part of a multiple-document inquiry task, and who also completed the learning outcome measure following reading and writing. Students were asked to write an essay, “explaining how and why recent patterns in global temperature are different from what has been observed in the past.” All participants were given a set of 7 documents containing information related to the causes of global temperature change. Five text-based documents covered several main topics including Ice Ages, the Carbon Cycle, The Greenhouse Effect, Solar Radiation, and Energy from Fossil Fuels. The document set also included a graph of CO2 Concentrations over the last 400,000 years, presented as its own document. In addition, students were provided with a seventh document, titled “Changes in Global Temperatures”, which provided textual background on the methods used to assess global temperatures. This document also included a graph of average global temperatures over the last 400,000 years, and a second graph showing the increases in average global temperatures from 1870 to 2010. The texts were excerpted from several online sources from the United States Geological Survey, the Public Broadcasting Service, the NASA earth observatory, the Environmental Protection Agency, as well as an extension module from an earth science textbook series (Bennington 2009). On average, the text-based documents were 326 words long (range: 208–475), with a Flesch Reading Ease of 62.36, and an average Flesch-Kincaid grade level of 7.9.

The document set was designed to include all information necessary to construct a coherent representation of the topic, based on an a priori causal model, but the text set also required integration of ideas across documents in order to achieve an understanding and answer the question of “how and why recent patterns in global temperature are different from what has been observed in the past”. Figure 1 gives a graphical representation of the concepts that were available to explain recent changes in average global temperatures (the target outcome is represented by the parallelogram). Each of the documents contained information that contributed to the creation of this causal model of global warming. No single document contained all the necessary information. Further, none of the documents directly addressed the inquiry question. Each addressed a different specific issue (e.g., CO2 trapping heat; or human contributions to CO2) that can be made relevant to the inquiry question only when repurposed and combined with information from other documents by the reader.

Fig. 1
figure 1

Concepts available in the text related to causes of recent changes in global temperatures

After reading the documents and writing their essays (with documents present), the documents were collected and students completed an inference verification test (IVT). The IVT was intended to assess the mental model of the causes of global warming that students constructed while engaging in the multiple-document inquiry activity. This learning outcome assessment was based on techniques developed by Royer and his colleagues (Royer et al. 1996). The test contained 18 statements that represented potential conclusions, connections, or inferences that could (or could not) be made based on the information in the document set and were consistent with the a priori model shown in Fig. 1. In this test, students needed to verify whether propositions followed (or did not follow) from the information contained in the documents. Some example items are “In the past 100 years, both fossil fuel use and CO2 levels have increased” and “Increases in fossil fuel use increase the amount of heat that escapes into space.” The first sentence is an example of a conclusion that is supported by the documents, but was not explicitly stated and requires the reader to make connections across documents to verify. The second is an example of a conclusion that was the opposite of a relation that could be inferred based on the documents. An overall proportion correct score was computed for the task, and higher levels of performance indicated better understanding of the conclusions that could be drawn by integrating ideas across the documents. Previous work has shown that performance on inference verification tasks reliably correlates with other measures of understanding including the quality of students’ written explanations (Griffin et al. 2012; Jaeger and Wiley 2015; Sanchez and Wiley 2006; Wiley and Voss 1999; Wiley et al. 2009).

Hand Scoring of Explanation Essays

The explanation essays derived from this inquiry activity were hand-coded using two different systems. The first system coded the explanation essays from the perspective of the a priori causal model in Fig. 1 and gave students credit whenever they made causal connections between the nodes in Fig. 1. (For similar work using this approach see Griffin et al. 2012). The second system was focused upon the argument structure present within each student’s essay and scored the essays for the presence of causal chains of ideas that culminated in recent changes in global temperatures. (For similar work using this approach see Hastings et al. 2014). Since both of these approaches are based on the a priori causal model created by the document set, only ideas and concepts present in the documents received credit for each approach. Both approaches used two independent raters to code the explanation essays. Raters first scored a small subset of the essays on their own (typically about 12–15 examples) and compared their responses. Following discussion, the remainder of the essays was independently scored. Interrater reliabilities computed with Cohen’s Kappa were above .80 for all coded measures. Skewness and kurtosis for all measures reported below are less than 1.

Using the a priori causal model, humans evaluated the explanations to identify which causal concepts were present (nodes in Fig. 1) and which were explicitly linked to each other.

Of the core concepts included in the a priori causal model used to create the document set (MODELCONC), students generally mentioned fewer than 5 in their explanations (M = 4.37, SD = 2.48). A subscore (TARGCONC) was computed based on the presence of just the 5 critical target concepts that most directly related to recent changes in global temperature as highlighted in Fig. 1. Of these 5 target concepts, students mentioned only an average of 1.35 (SD = 1.21) in their explanations. The number of explicit connections that students made among the concepts was also coded (MODELCONN). On average, students made 1.70 (SD = 1.71) explicit connections between concepts.

The second coding scheme coded the written explanations for the argument structure present in each essay by exploring the number and length of causal chains that students developed against the idealized causal chain shown in Fig. 2. The target outcome is represented by a parallelogram and there are several paths that can be connected to explain this outcome. Students’ explanation essays were scored for the length and number of explanatory chains. We first scored the overall number of propositions or elements (PROPS) that were included in their explanation essay (rectangles in Fig. 2). On average about one-third of these elements were mentioned (M = 6.58, SD = 2.91). Then we counted the number of connections along causal chains (LINKS) that were connected to the target outcome (M = 2.61, SD = 2.38) and the number of chains that included intervening connections between initiating factors and the target outcome (CHAINS) (M = 0.70, SD = .71). The one exception to chain coding was that one chain that was explicitly described in the text (the fossil fuel chain represented by 0-3-50) was coded separately (EASYCHAIN).

Fig. 2
figure 2

Example argument structure in student essay linked to outcome

Finally, a holistic code was developed to categorize the explanations into five hierarchical levels of quality (EXPQUAL): (1) No core content (did not include any elements) (N = 8), (2) No causal chains (included core elements but did not connect to the outcome) (N = 32), (3) No intervening factors (connected at least one element to the outcome directly, but had no chains of connections to the outcome) (N = 38), (4) Simple intervening factor (EASYCHAIN: the only argument chain was the one that was stated explicitly in the text: from increased factories/vehicles/technology to increased use of fossil fuels) (N = 14), (5) Advanced intervening factor (at least one causal chain with at least one intervening element other than the “easy” chain mentioned above) (N = 86). The No core content responses failed to identify any important information that could be connected to the outcome. The No causal chain responses included elements that could be part of the explanation, but did not make it clear that the element was leading to the outcome by articulating an explicit connection. Almost 23% of the students did not create a minimally connected explanation. No intervening factors responses were also very common. In these explanations, at least one element was directly connected to the outcome. Often 2 or more distinct chains were present, but these chains were not connected to each other. A common example of the argument structure present in this type of response was the student asserting that increased fossil fuel use leads to increased temperatures, that increased CO2 in the atmosphere leads to increased temperatures, and that increased temperatures are due to more heat being trapped, but these three relations were stated separately not linked together by the student. The final two levels represented explanations that included at least one chain with an initiating cause connected to an intervening factor that was then connected to the outcome. One chain (from increasing vehicles, factories, and technology to the increase in fossil fuel use to global warming) was treated separately because the links between these were included in the documents, and therefore the student did not need to construct this argument. Because we were interested in identifying transformation, we separated out those explanations that did not require much text-based transformation because they included only this chain, from those that involved more active transformation of the sources. A similar hierarchical approach has been used in a study on another scientific topic, coral bleaching (Hastings et al. 2016). In that study, the researchers found that the four quality categories used were associated with learning using scores on a multiple choice test as comparison. The lowest quality group had significantly lower learning (32%) than the middle two quality categories (47%, 52%) which were each lower than the highest quality group (63%). These results show that the quality categorization scoring has utility as a measure of learning.

Results

Basic Descriptive Measures

On average, the explanations that students wrote were around four paragraphs (M = 4.12, SD = 1.75), or 21.12 (SD = 8.32) sentences and 313.33 (SD = 108.44) words long. The three measures of response length (words, sentences and paragraphs) were all related (rs > .50, ps <. 01). To avoid multicollinearity issues, only the smallest grain size for length (words) is included in tables and analyses (LENGTH). Only a third of the students included at least one reference to the documents to cite the source of their evidence or ideas in their responses (N = 60/178, SOURCE). A simple macrostructure code categorized whether the student answer directly responded to the prompt (ANSWERQ). This code indicated whether students attempted to write an explanation about how and why recent patterns in global temperature are different from what has been observed in the past (1), or if they did something else (0) such as write their opinion about global warming, or just simply list ideas from the text without using ideas to try to answer the question directly. The majority of students were coded as attempting to address the essay prompt in their explanations (N = 142/178).

Predicting Explanation Quality from Hand-Scoring Approaches

One main purpose of this study was to test which of the various approaches to coding the explanations might best predict explanation quality and performance on tests of student understanding. The simple correlations among the basic descriptive and hand-scored metrics and their ability to predict the holistic explanation quality scores (EXPQUAL) and performance on the test of understanding (the IVT) are shown in Table 1. Table 1 shows that the number of words (LENGTH) in each explanation was a significant predictor of both explanation quality (EXPQUAL) and comprehension test scores (IVT).

Table 1 Correlations among descriptive measures, hand-scored measures of explanation quality, and learning outcomes

The presence of references to the documents or citations of the sources of the documents as part of the essay (SOURCE) was not associated with explanation essay quality or test scores. If anything, readers who were more likely to refer to the documents when writing about this science topic were less likely to focus on the most important information (TARGCONC). While the presence of citations when writing from multiple documents in history is usually related to better quality essays (e.g., Britt and Aglinskas 2002), this may be due to the important role of a source in evaluating historical documents (e.g., perspective, bias, time, culture; see Rouet et al. 1996); or when a document set includes opposing theories or discrepancies (Bråten et al. 2009; Rouet et al. 1996), which was not the case in this particular activity.

The overall macrostructure of the response (whether it attempted to answer the question by providing an explanation; ANSWERQ) was also a significant predictor of both explanation quality and test scores.

In the simple correlations shown in Table 1, all three measures derived from the coding based in the a priori causal model (MODELCONC, TARGCONC, MODELCONN) predicted both explanation quality and comprehension test scores. When measures listed in Table 2 were entered into a simultaneous regression to test for unique predictors of explanation quality, the overall model provided a good fit, F(4, 173) = 39.05, MSE = 0.96, p < .001, and accounted for 47% of the variance in explanation quality. Attempting to answer the question (ANSWERQ), including target concepts (TARGCONC), and making connections (MODELCONN), all accounted for unique variance. (MODELCONC was not in the model to avoid multicollinearity due to its high correlation with TARGCONC). Essay length (LENGTH) was not a significant unique predictor of explanation quality when included in a model with these other measures.

Table 2 Holistic explanation quality scores as predicted by codes for a priori causal model

As shown in Table 1, significant correlations were also seen between holistic explanation quality scores and the measures derived from the analysis of the arguments present in the student explanation essays (PROPS, LINKS, EASYCHAIN, CHAINS). When the measures in Table 3 were entered into a simultaneous regression, the overall model provided a good fit of the data, F(4, 173) = 97.68, MSE = .56, p < .001, and accounted for 69% of the variance in explanation quality scores. All measures were unique predictors of explanation quality. The strong prediction of these codes for explanation quality would be expected because the holistic explanation quality score was computed based on these measures.

Table 3 Holistic explanation quality scores as predicted by codes for essay argument structure

Predicting Student Understanding from Hand-Scoring Approaches

The next question was to what extent the two sets of hand-coded measures would uniquely predict student understanding. When the measures in Table 2 were entered into a simultaneous regression to test for unique predictors of comprehension scores (IVT), the best fitting model included only the number of target concepts (TARGCONC) and number of connections between concepts (MODELCONN). This model, shown in Table 4, provided a good fit for the data, (F(2, 179) = 43.19, MSE = .01, p < .001), and accounted for 33% of the variance in test scores.

Table 4 Comprehension test scores (IVT) as predicted by codes for a priori causal model

The measures derived from the analysis of the arguments present in the student explanation essays (Table 3) were entered into a simultaneous regression to test for unique predictors of comprehension test scores (IVT), the best fitting model included the number of propositions and number of links. The model shown in Table 5 was a good fit of the data, F(3, 174) = 14.70, MSE = .02, p < .001, and accounted for 20% of the variance in the test scores. The number of links and the presence of the chain that was present in the text were significant unique predictors of IVT scores, while the number of propositions in the written argument was marginal.

Table 5 Comprehension test scores (IVT) as predicted by codes for essay argument structure

These analyses suggest that measures of coverage of ideas in essays, and the extent to which ideas are connected or integrated, are critical features for predicting understanding.

Automated Scoring of Explanation Features Using LSA/Coh-Metrix

The second phase of analyses attempted to automatically capture these critical features of the student explanations that emerged from hand coding. In a first pass at automated scoring, we attempted to leverage existing technologies including using LSA to assess the similarity of the explanations with an idealized peer explanation to generate a coverage score; LSA to assess similarity of the explanations to the source material to create a plagiarism or copying score; and Coh-Metrix to assess cohesion, causality, and lexical diversity of the explanations.

Idealized Peer Explanation Similarity Scores

One LSA approach compared student explanations to an idealized peer explanation (i.e., an explanation constructed by the researchers from the best student responses). The idealized student explanation is included in the Appendix. The approach of using an idealized student explanation essay rather than explanation essays written by experts for comparison was based on Ventura et al. (2004) who reported better prediction from peer-based examples. The idealized explanation was assembled from the two best student essays such that it would score highly on all of the hand-coded features that predicted learning outcomes. We verified that the LSA similarity scores with the idealized essay correlated well with the hand-coded measures (.39 with TARGET CONCEPTS, .45 with MODEL CONNECTIONS, .67 with PROPOSITIONS, .41 with LINKS). LSA was used to compare all student explanations for similarity to the idealized peer explanation essay using the whole-essay-to-whole-essay comparison tool at lsa.colorado.edu. As shown in Table 6, similarity to the idealized explanation essay (IDEAL) predicted both holistic explanation quality scores (EXPQUAL) and the learning outcome measure (IVT).

Table 6 Correlations among LSA, Coh-Metrix, and machine-learning measures, with learning outcomes and explanation quality

Plagiarism Scores

Based on previous work (Britt et al. 2004), LSA similarity scores were computed between student explanations and the original sources to estimate how much of each student’s essay was copied directly from the source documents (and therefore not transformed). Using lsa.colorado.edu’s TASA “General_Reading_up_to_12th_Grade (300 factors)” document space comparison, we computed the cosine between each student sentence and each of the source document sentences. When cosines were above .75, we considered that sentence as copied from the original source (Britt et al. 2004). On average, 32% (SD = .20) of sentences in student explanation essays appeared to be copied from the sources. Table 6 shows the simple relations between the LSA-based COPY score, explanation quality and student learning, including a significant negative correlation with explanation essay quality. This suggests that argument quality suffers as students fail to transform information as they write. However, in this case, no overall negative effect of copying (COPY) was seen on the learning outcome measure (IVT).

As shown in Table 7, entering the similarity to the idealized explanation essay score (IDEAL) along with the plagiarism score (COPY) predicted 41% of the variance in the explanation quality scores (EXPQUAL), F(2, 175) = 61.20, MSE = 1.06, p < .001. When entered together in this simultaneous regression, similarity to the idealized explanation essay (IDEAL) positively predicted explanation quality (EXPQUAL), while similarity to source documents (COPY) negatively predicted explanation quality. For the learning outcome measure, adding similarity to source documents (COPY) to a model with the IDEAL scores did not improve the fit. IDEAL scores predicted learning at r = .40 as shown in Table 6, meaning that they accounted for 16% of the variance in comprehension test scores.

Table 7 Holistic explanation quality scores as predicted by similarity to idealized essay and source documents

Interestingly, even though the plagiarism scores (COPY) and the similarity to the idealized explanation essay scores (IDEAL) predicted explanation quality in opposite directions, the two were found to be positively related to each other (Table 6). This positive relation suggests that students who were copying individual sentences were generally selecting relevant content to transcribe into their explanations, which may explain why copying did not have a negative relation with learning. Although actively transforming information may be the best strategy for understanding, selecting and copying relevant information seems likely to be better than writing irrelevant information or failing to engage with the text at all. Also, since none of the documents specifically provided an answer for the essay question, even copying isolated sentences entailed some level of repurposing of information.

Coh-Metrix Indices for Casuality, Cohesion and Lexical Diversity

The explanations were also submitted to Coh-Metrix as an automated approach to scoring the extent to which the essays integrated and transformed information into a coherent essay. In particular, we used SMCAUSvp (the incidence causal verbs and causal particles) as a measure of causality (reported as CAUSAL in the table). This was motivated by earlier work that found that students who demonstrate better understanding of the material on comprehension tests tend to write essays that have more connected ideas, and more causal connections (Britt and Aglinskas 2002; Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1999). For this measure, a higher score means higher incidence of causal terms in the essays (Results are similar for SMCAUSv and SMCAUSp. We selected SMCAUSvp because hand scoring used both verbs and particles. The other SMCAUS measures did not predict learning outcomes). As a measure of cohesion, we used standardized scores for LSAPP1 (LSA similarity among adjacent paragraphs, reported as COH in the table). This was supplemented by standardized scores for LSASS1 for the 21 essays that were only one paragraph long (instead of using 0 scores for LSAPP1 as assigned by Coh-Metrix). For this LSA measure, a higher score means more similarity across parts of the response, representing more cohesion. Finally, we also explored the lexical diversity of all words using LDTTRa (LEXDIV in the table), as another potential measure of cohesion or focus. A higher LEXDIV score means that the response contained a broader range of vocabulary, while a lower LEXDIV score means that a more restricted range of words were used. The CAUSAL scores for student responses ranged from 19.87 to 138.89 (M = 52.71, SD = 16.18). The LSAPP1 scores that were used to compute COH ranged from .09 to .70 (M = .33, SD = .12). LEXDIV scores ranged from .30 to .75 (M = .49, SD = .07). The relations among the measures are shown in Table 6. Consistent with the notion that too much lexical diversity can be a sign of a lack of focus or coherence in explanation essays, there was a significant negative relation between cohesion indices (COH) and lexical diversity (LEXDEV).

The correlations in Table 6 also showed that scores of explanation quality (EXPQUAL) and comprehension test scores (IVT) both increased with cohesion (COH). Comprehension test scores (IVT) increased with the number of causal expressions (CAUSAL), while lexical diversity (LEXDIV) was found to be a negatively related to the holistic scores of explanation quality (EXPQUAL). Although the significant relation between causality (CAUSAL) and learning (IVT) replicates prior work using hand coding for causal expressions (Britt and Aglinskas 2002; Voss and Wiley 1997, 2000; Wiley 2001; Wiley and Voss 1999), the magnitudes are modest compared to the relations seen with hand coding for connections in Table 1 (c.f. MODELCONN and LINKS). One possible reason for this difference is that the set of verbs and particles that Coh-Metrix uses to code for causality might be limited to a set of generic causal terms (change, cause, enable, and make) while the terms used to code as causal connections during hand coding (MODELCONN and LINKS) included more context-specific causal terms, for example “increases”, “helps”, “intensifies”, “traps”, “heats” and “melts”. Also, Coh-Metrix counts causal connectives that are completely redundant with others, as well as causal terms that are used in expressions unrelated to expressing causal relations (e.g., “We need to change our way of life.”, “This makes sense.”), which could weaken the relation. The predictions from cohesion scores (COH) and lexical diversity scores (LEXDIV) were also modest in magnitude (rs beween −.22 and .26).

When the Coh-Metrix measures were submitted to a simultaneous regression predicting explanation quality, only lexical diversity was found to be a unique predictor as shown in Table 8. This model was significant, F(3, 173) = 4.67, MSE = 1.68, p < .01, but predicted only 8% of the variance in explanation quality. On the other hand, causality and cohesion were significant predictors of comprehension test scores as shown in Table 9. Again this model was significant, F(3, 177) = 6.44, MSE = .02, p < .001, but predicted only 10% of the variance in test scores.

Table 8 Holistic explanation quality scores as predicted by Coh-Metrix indices
Table 9 Comprehension test scores (IVT) as predicted by Coh-Metrix indices

Best Fitting Model from Out-of-the-Box Approaches

The simple correlations among the indices derived from the three approaches are shown in Table 6. The IDEAL scores positively related to the cohesion measure derived from Coh-Metrix and negatively related to the lexical diversity measure. Submitting the idealized peer explanation to Coh-Metrix showed that it had above average cohesion, COH = .57, and below average lexical diversity, LEXDIV = .38, compared to the corpus. Both of these features could reflect the focus of the idealized essay on explaining a particular topic, which results in restricted vocabulary usage and overlap among sentences. The idealized explanation essay also had a CAUSAL score of 59 which was slightly above average. This is consistent with prior work showing that causal connections in explanation essays are generally a positive predictor of student understanding.

When the LSA measures (COPY and IDEAL) and the three Coh-Metrix measures were included in a simultaneous regression to predict explanation quality scores, none of the Coh-Metrix measures captured any unique variance. The best fitting model was the model shown in Table 7 which predicted 41% of the variance in the explanation quality scores. In contrast, when the measures derived from both LSA and Coh-Metrix were included in a simultaneous regression to predict student understanding, all measures except the COPY score were found to be significant unique predictors, as shown in Table 10. The model provided a good fit for the data, F(5, 171) = 10.25, MSE = .02, p < .001, and predicted 23% of the variance in comprehension test scores.

Table 10 Comprehension test scores (IVT) as predicted by LSA/Coh-Metrix indices

As in the previous analyses, similarity to the idealized peer explanation (IDEAL), incidence of causal terms (CAUSAL), and cohesion (COH) were all positive predictors of test scores. However, in this combined analysis, lexical diversity was now also a positive predictor of test scores. Follow-up analyses indicated this was due to the addition of the similarity to the idealized explanation scores (IDEAL) to the model. In the presence of this measure (along with measures of cohesion and causality), the use of a broader range of vocabulary emerged as a positive predictor.

In sum, attempts to use out-of-the-box tools were to some extent successful as the models based in metrics derived from automated LSA and Coh-Metrix scores predicted a significant amount of the variance for each outcome measure. However, the fit and amount of variance explained for each outcome was clearly inferior in magnitude to the best fit from the models based on measures derived from human coding in the previous section.

Automated Scoring Using Machine Learning

Although LSA and Coh-Metrix did provide some useful indices of writing quality, our next step was to explore a more specific content-based automated scoring approach (Graesser and McNamara 2012; Magliano and Graesser 2012). To provide an example of such an approach, prior work on the dialog-based ITS mentioned above, MetaTutor, has also included a task where students wrote a paragraph describing their existing knowledge on a topic (Lintean et al. 2011). When researchers compared three methods for identifying students’ mental models of a topic from the paragraphs: content-based measures derived from LSA, cohesion-based measures from Coh-Metrix, and word-weighting features specially derived from their corpus of paragraphs; they found that the word-weighting features outperformed the other approaches. To provide another example, researchers have found that neither general indicators of reading strategies nor indicators of textual complexity were effective at predicting 3–5th graders comprehension of stories, but a machine-learning approach using a combination of some of these features was effective (Dascalu et al. 2015).

While many Automated Essay Scoring (AES) systems have been developed to provide a more efficient evaluation of student writing (e.g. Larkey and Croft 2003; Shermis and Hamner 2012), these systems use a wide variety of generic features -- lexical, syntactic, and semantic (e.g. Deane 2013; Roscoe et al. 2014) -- to effectively provide holistic, summative evaluations of student essays. Yet they have been criticized for failing to accurately judge the relevance and appropriateness of student responses (Dikli 2006) and for their lack of construct validity (Condon 2013; Roscoe et al. 2014). Some recent research has addressed the construct validity issue for students’ persuasive essays by detecting statements of opinion to provide a holistic measure of persuasive essay quality (Farra et al. 2015). Other recent research has focused on providing formative assessment, using logs of keyboard activity while students were writing (Zhang and Deane 2015). However, to our knowledge, no one else has tried to use a machine learning algorithm to assess the causal structure of student arguments or explanations in order to serve as an assessment of student understanding.

Developing a system that can identify the causal structure of any text is very difficult. Working with newspaper texts, Rink et al. (2010) tried to develop a system that could identify causal relations between events using a wide range of linguistic resources and techniques, including part-of-speech tagging, syntactic parsing, WordNet (Miller 1995), VerbOcean for semantic links between verbs (Chklovski and Pantel 2004), dependency parsing, word sense disambiguation (Mihalcea and Csomai 2005), and a semantic parser for identifying the semantic frame (Bejan and Hathaway 2007). With all of these techniques combined, their system achieved PrecisionFootnote 1 = 0.33, Recall = 0.61, and F1 = 0.43. By including manual annotation of the temporal relations between events in the text, they increased performance to F1 = 0.58. Thus, creating a system that may be able to detect causal structure present in student essays in order to measure student understanding represents a central challenge and goal for the present work.

In this section we explore the utility of using a machine-learning (ML) approach for assessing the quality of student explanation essays, and using metrics produced by the ML approach to predict student understanding. An overview of the three main steps involved in this process is that we trained ML models on the annotated explanation essays (annotated with the codes in Fig. 2) to identify each individual concept code. In the second phase, the identified concepts were used to train models which identified the existence of components of causal connections, and of the specific causal connections between pairs of concepts. Finally, we used the same rule-based process that the human coders did to calculate each essay’s holistic explanation quality score. The subsections below describe the first two processes.

Concept Detection

For concept detection, we treated the problem similarly to a tagging problem like part-of-speech tagging. We preprocessed the explanation essays by doing spelling correction and stemming. We did not remove stop words. We did replace unique words (words which appeared only once in the entire corpus, most often because they were badly misspelled) with a special UNKNOWN token. Because these words occurred only once, the system could not learn useful information from them anyway. The UNKNOWN token occurs frequently enough that it does not carry strong semantic content for the system, similar to words like “a” or “the”. Then we applied our machine learning approach with a 7-word sliding window across the text to identify concepts within that window (Hughes et al. 2015). The fixed-size sliding window approach allows us to avoid the difficulties for machine learning from variable-length input, but the size of the window ensures that the words of a concept will almost always fall entirely within one of the windows (Hughes et al. 2015). For each of the concept codes (the nodes in Fig. 2), we trained a logistic regression classifier in which the features were the words and the bigrams within the window, as well as their relative positions within the window.

For example, consider the student sentence, “Factories began to burn large amounts of fossil fuels to create energy.” The word “factories” was coded as Concept 0 by the human annotators. For the classifier which predicts Concept 0, the first sliding window across this sentence would include 13 features. The first 7 would be word-based features signifying that the stem of the target word was “factory”, the following word stems were “begin”, “to”, and “burn”, and the words before the target word that came before the start of the sentence. There would also be 6 bigram features, including “START-factory-1”, “factory-begin + 0”, “begin-to + 1”, etc., where the numbers represent the relative position within the window. This set of features would be a positive example of the target class (Concept 0). The next window would have “began” as its central target word, and would be a negative example of Concept 0.

We trained and evaluated the classifiers for the word-level tagging task using 5-fold cross-validation, using 80% of the explanation essays for training and the remaining 20% as the test set (repeated for each of 5 test sets). This gave a classification for each concept code. The results showed that the classification is quite reliable. For the entire set of explanation essays,Footnote 2 the macro-averaged Precision was 0.77, Recall was 0.71, and F1 was 0.74.

Detecting Connections

In contrast with Hastings et al. (2014) which used hand-coded essays to determine concepts directly, in this study we used the output of the concept detection from the automatic inference mechanism described above as input for automated detection of causal connections.

Because causal connections between concept codes generally occur over a wider span of text, we cannot use the same type of sliding window method to automatically identify them. Instead we trained a higher-level classifier which used as inputs the results of the concept tagging along with three other tags that were learned by the window-based tagger. One was for connectors (e.g. “because of”, “as a result”) that the coders had annotated in the text. The other two were for concept codes that had been marked as causers and results. The second-level classifiers also used these other features based on the results of the first-level classifiers: the minimum and maximum probabilities for each predicted label (code, causer tag, result tag, cause-effect tag), the binary yes/no prediction for each label, and a binary combination prediction for each pair of codes that was identified in the sentence. These were used to train a logistic regression classifier for each causal connection between two concepts that occurred in the training essays.

Again we assessed this method using 5-fold cross-validation. On this task, the classifiers had macro-averaged predictions with Precision = 0.64, Recall = 0.40, and F1 = 0.49. Although this level of prediction accuracy is not as strong as that for concepts alone, this is unsurprising because it relies on the outputs of the first-level concept predictions. The task is also considerably more complex. Instead of “just” trying to predict the 19 codes from Fig. 2, we had to potentially distinguish between 19*19 connections (although only 34 combinations actually appeared in the responses).

Using the output from the machine learning of concept coding (MLCODES), connections coding (MLCONN), detection of the easy chain (MLEASY) and detections of other chains to the outcome (MLCHAINS), predictions were then computed for the overall quality of the explanations (MLQUAL). To assign explanation essays to appropriate categories, we employed the same criteria to compute the holistic explanation quality scores as with the hand coding.

Predicting Explanation Quality and Test Scores with Machine Learning

The ML approach used explanation essays that had been annotated with structure codes as its input. The predictions derived from the ML approach correlated very highly with hand scoring for number of propositions or elements in the arguments (r = .71, p > .001), and moderately for the number of links (r = .35, p < .001), the easy chain (r = .33) and other chains (r = .25).

Table 11 shows the results of a simultaneous regression predicting explanation quality (EXPQUAL) using metrics derived from all automated approaches (ML, LSA and Coh-Metrix). This model predicted 49% of the variance in the hand-coded holistic explanation quality scores (EXPQUAL), F(9, 167) = 18.11, MSE = .95, p < .001. Only the number of codes that were detected by the ML approach (MLCODES), the two LSA scores (COPY and IDEAL), and the Coh-Metrix cohesion score (COH) were significant unique predictors.

Table 11 Hand-coded holistic explanation quality scores as predicted by all automated approaches

The Coh-Metrix cohesion score (COH) seems to have come closer to capturing connections between sentences better than the connection measure derived from the ML approach (MLCONN) which failed to predict explanation quality. The lack of prediction by the number of connections derived from the ML approach (MLCONN) suggests there is still more work that needs to be done to automatically detect and identify relations within student arguments. In many cases, students used vague anaphoric references across sentences and explicit marking of rhetorical structure in earlier sentences. The humans were able to use both of these features to determine structural relationships in the hand coding, but currently the ML approach is not able to use either of these sources of information to classify the structure and connections that may be present in student arguments. This also affected the ability to detect chains in the explanation essays.

Table 12 shows the results of a simultaneous regression predicting student understanding using metrics derived from all automated approaches (ML, LSA and Coh-Metrix). The model shown in Table 12 was a good fit for the data, F(8, 168) = 6.80, MSE = .02, p < .001, and it predicted 25% of the variance in test scores. Of the new ML measures, the estimated number of propositions (MLCODES) was found to contribute unique variance.

Table 12 Comprehension test scores (IVT) as predicted by all automated approaches

Table 13 provides a summary of all the various approaches used to predict holistic essay quality scores (EXPQUAL) and student understanding as assessed by comprehension test scores (IVT). Although the measures derived from hand coding of the students’ arguments are the best predictors of the explanation quality score, and the measures derived from hand coding of the a priori causal model are the best predictors of comprehension test scores, a combination of several automated scores provides relatively good prediction for both dimensions of student performance. The addition of ML metrics explains additional variance beyond LSA and Coh-Metrix measures for both outcomes.

Table 13 Summary of variance explained (R2) in explanation quality scores (EXPQUAL) and comprehension test scores (IVT) as predicted by different methods

Discussion

The results show promise in combining multiple automated methods as part of an attempt to approximate the success of hand-coding approaches in assessing the quality of student understanding from written explanations. This study focused on understanding of a single topic, using a single explanation essay to serve as an assessment of each student’s understanding at a single time point, and using a comprehension test as the criterion measure. Using this approach, both hand-coding approaches achieved high interrater reliability and yielded scores on multiple dimensions that were predictive of performance on the comprehension test. The hand-coding method that was based on the a priori casual model of the phenomenon of global warming was the best single predictor of comprehension test performance. This makes sense when considering that the comprehension test itself was directly dependent on whether students constructed the mental models they needed to verify potential causal relations, while being less relevant for predicting whether students could or would be inclined to write essays in which they were sure to explicate how every causal relation ultimately impacts the target phenomenon. The coding approach based in the a priori model may be more sensitive to variation in understanding of causal relationships within various parts of the model, but it is also less sensitive to how well students can explicate in writing how all those relationships fit together, and articulate causal chains that ultimately lead to the outcome to be explained.

In addition, several measures were less important than expected, or than they seemed from simple correlations. Essay length did not predict essay quality or comprehension scores once more direct measures of coverage and connectedness were taken into account. Similarly, responsiveness to the essay prompt also no longer predicted performance after coverage and connectedness were included in regression models. These results suggest that both length and responsiveness to a prompt may in some cases serve as proxies for coverage or thematic focus within an explanation, but that actually scoring for the content and structure is a more powerful approach. That is, longer essays will not always be better explanations, just as longer summaries may be less focused and can contain irrelevant details (Wade-Stein and Kintsch 2004).

Interestingly, the presence of citations or references to particular documents as part of student explanations was if anything a negative feature. Readers who were more likely to refer to the documents when writing about this science topic were less likely to focus on the most important information. This suggests that these students may have been engaging in a knowledge-telling approach of simply relating information from each source, as opposed to a knowledge-transforming approach in which they selected the most important information in an attempt to integrate it. While the presence of citations when writing from multiple documents in history is usually related to better quality essays (e.g., Britt and Aglinskas 2002), this may be due to the important role of a source in evaluating historical documents which may be particularly needed when a document set includes opposing theories or discrepancies (Bråten et al. 2009; Rouet et al. 1996). However, this was not the case in this particular activity. Under other circumstances, such as when contradictions are present between sources and need to be reconciled, sourcing may emerge as a more positive feature.

In sum, the results from the hand-coding approaches demonstrated that measures representing the coverage of key ideas in the essays, and the extent to which ideas are connected or integrated, were critical features for predicting understanding. As existing technologies, both LSA and Coh-Metrix were worth exploring next to determine how well they might be able to capture the quality of student understanding before investing substantial effort in machine-learning approaches.

Our attempts to use out-of-the-box tools were to some extent successful, and metrics derived from LSA and Coh-Metrix were found to predict a significant amount of the variance for each outcome measure. As found in previous work (Ventura et al. 2004), computing similarity to an idealized peer essay with LSA provided a useful metric that predicted both hand coding and student understanding. The LSA-based plagiarism score was also useful. Copying scores were negatively correlated with overall explanation quality, and this relation became even stronger when similarity to the idealized peer essay was also included in the regression model. The idealized peer explanation itself had a modest copying score, and the similarity-to-idealized-essay scores were positively correlated with copying scores in simple correlations. The fact that the relation between copying scores and explanation quality became negative once similarity-to-idealized-essay scores were added to the regression model suggests that the simple correlation might represent the tendency for students to copy the most task-relevant sentences from the texts. Thus, the negative correlation for copying scores that emerged in the regression that already included similarity-to-idealized-essay scores represents copying of less relevant sentences from the original sources. Although actively transforming information may be the best strategy for understanding, selecting and copying relevant information may be better than writing irrelevant information or failing to engage with the text at all. Also, since none of the documents specifically provided an answer for the essay question, even copying isolated sentences entailed some level of repurposing of information. In addition to these reasons, the low proportion of copied sentences in these essays (the average was only around 30%) may explain why the plagiarism scores did not have a negative relation with understanding in this study.

In contrast, the metrics derived from Coh-Metrix were the poorest predictors of comprehension test performance, and were only related weakly to the measures derived from hand scoring. The causal metric also showed no relationship to how similar each essay was to the idealized peer explanation essay. It is notable that the idealized peer explanation essay was only about average in its causal Coh-Metrix score, which might be due to that metric giving credit for redundant or task-irrelevant causal terms in many essays. In addition, the standard generic causal terms used by Coh-Metrix may be unable to recognize topic-specific terms that reflected causal relations in this particular context (e.g., “CO2 traps heat”). These issues may have obscured relations that may have been seen with a more topic-specific measure of causality.

Yet, even though the relations between the metrics derived from Coh-Metrix and student understanding were modest, they were still interesting to the extent that they provided a contrast to prior work that has used Coh-Metrix to explore the compositional quality of persuasive student essays. While prior work has found negative relations between cohesion and causal expressions and expert ratings of composition quality, in the present work these features were positive predictors of student understanding. Similarly, while prior work has suggested that lexical diversity may be a positive predictor of expert ratings of composition quality in persuasive essays, in general it was a negative predictor of explanation quality in this study. Only once the coverage and connectedness of student explanations were taken into account in regression models did lexical diversity emerge as a positive predictor. One interpretation of these results is that lexical diversity may sometimes represent a third variable (such as student ability or verbal intelligence), and it may predict expert ratings of essay quality because more-able students may generally produce better essays for a wide variety of reasons. In the context of an explanation essay, however, using a more diverse set of words to describe a particular phenomenon may be a sign of a lack of focus on developing an integrated causal model of that phenomenon.

Finally, the best prediction of explanation quality from the automated measures was from a model that included machine-learning scores in addition to LSA and Coh-Metrix indices. The new machine-learning approach did a reasonable job of learning and applying the coding rules employed in the more structure-sensitive hand-coding system, and the machine-learning scores added 8% to the total variance explained over and above the contributions of LSA and Coh-Metrix. The machine-learning scores also improved the prediction of comprehension test performance over other automatic methods for detecting structure. These results provide for the key conclusions of this study, and point to the utility of using this machine-learning approach in combination with LSA for detecting similarity of a response to an idealized response and similarity to original sources, and Coh-Metrix for detecting similarity, focus, and causality within a response. They suggest promise for hybrid methods combining those that are good for detecting content (similarity to sources, similarity to an idealized essay, ML concepts) and those that are good for detecting structure (similarity with idealized essay, cohesion, causality). It is possible that these methods may eventually be able to be applied within AES systems as well.

One reason why the benefits from the machine-learning approach were so modest in the present study may be because of the complexity of this particular global warming document set. In other studies, we have begun using simpler document sets for inquiry tasks on coral bleaching (“explain how and why coral bleaching rates vary at different times”) and skin cancer (“explain how and why rates of skin cancer differ around the globe”). The inquiry prompts for these activities still require inferences across multiple documents, but both document sets are less complex than the global warming set. They have fewer and shorter documents, fewer initiating causes, and fewer and simpler elements. The causal model for the global warming text set could be viewed as very complex on all dimensions, while the two newer text sets are only moderately complex in that there are only 2 initiating causes and 10 key elements across 5 documents. For both topics, human and machine-learning scoring for explanation quality were found to be highly correlated (Hughes et al. 2015), and the machine-learning approach was better able to predict student understanding of the coral bleaching and skin cancer units from student essays.

The current study focused on detecting student understanding of a single topic, by using a single explanation essay to serve as an assessment of each student’s understanding at a single time point. However, in most cases, developing a coherent understanding of topic will require working through ideas, building and revising explanatory models, and constructing understanding iteratively over time. Such a process requires revision, and providing real-time, tailored feedback to students can facilitate and enable this process. The long-term goal for this work is to enable near instantaneous calculation of what is included in student explanatory essays and what is missing, which would represent the basis for an intelligent tutor that could help students to improve the quality of their written explanations as well as their understanding of the subject matter. Because achieving this future goal requires the ability to provide detailed feedback about the quality of the reasoning that is present in explanations (such as whether they include explicit connections to the target outcome or to other initiating causes), an assessment of the structure of students’ explanations is needed, which is what the present machine-learning approach attempted to capture by sorting essays into quality categories.

A recent study used a similar set of quality categories to give college students feedback on initial drafts of explanation essays written as part of the simpler coral bleaching inquiry unit (Kopp et al. 2016). After writing initial explanations, students were randomly assigned to either receive targeted feedback (in relation to the completeness or coherence of their essays as indicated by the quality categories) or no feedback about their drafts (students were simply asked to revise). The targeted feedback prompted students to create longer chains and to give more complete answers, and was intended to benefit those students who failed to include intervening elements or multiple initial causes. Overall, students included significantly more connected concepts in their explanations after revision. However, the targeted feedback condition particularly helped those whose initial essays were of poor quality. Receiving appropriate feedback helped them significantly improve their explanations and learn more from the activity.

These results are promising, and such a multidisciplinary approach to providing feedback may eventually have utility in a classroom setting. With current calls in science education for students to learn about explanation and argumentation in science classes, and the increasing appreciation for writing-to-learn activities, an intelligent tutoring system that can give immediate feedback based on student understanding will be helpful to teachers in the classroom. Other areas of research have shown the importance of feedback and revision for student progress. For example, a meta-analysis on the effectiveness of feedback on quality of student compositions has shown moderate effect sizes (e.g., 0.77, Graham and Perin 2007). Similarly, revision is essential to improving the quality of written compositions (Flower and Hayes 1981). At present, however, much more work needs to be done to extend these findings from demonstrating the effectiveness of feedback and revision in a learning-to-write context to demonstrating the effectiveness of feedback and revision in a writing-to-learn context (i.e., as part of subject-matter learning).

Finally, even without the explicit metacognitive emphases of iSTART (McNamara et al. 2007a, b) and MetaTutor (Lintean et al. 2011), it is hoped that a system that provides explicit feedback on specific weaknesses in student explanations will lead to more complete reasoning and better learning from multiple-document inquiry tasks, which in turn might transfer and support better performance in other writing-to-learn tasks (as in Britt et al. 2004). There are many other types of writing activities that may be employed besides causal explanations or arguments (Braaten and Windschitl 2011), and we believe our approach can be extended to try to detect student understanding from other types of open-ended responses such as problem-solution or compare-and-contrast essays. Given the differences that have been seen between the features that have predicted the quality of persuasive and explanatory essays, exploring detection of student understanding from different essay types will be an important step for future work.