Skip to main content
Erschienen in: Social Network Analysis and Mining 1/2024

Open Access 01.12.2024 | Original Article

Big data meets storytelling: using machine learning to predict popular fanfiction

verfasst von: Duy Nguyen, Stephen Zigmond, Samuel Glassco, Bach Tran, Philippe J. Giabbanelli

Erschienen in: Social Network Analysis and Mining | Ausgabe 1/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Fanfictions are a popular literature genre in which writers reuse a universe, for example to transform heteronormative relationships with queer characters or to bring romance into shows focused on horror and adventure. Fanfictions have been the subject of numerous studies in text mining and network analysis, which used Natural Language Processing (NLP) techniques to compare fanfictions with the original scripts or to make various predictions. In this paper, we use NLP to predict the popularity of a story and examine which features contribute to popularity. This endeavor is important given the rising use of AI assistants and the ongoing interest in generating text with desirable characteristics. We used the main two websites to collect fan stories (Fanfiction.net and Archives Of Our Own) on Supernatural, which has been the subject of numerous scholarly works. We extracted high-level features such as the main character and sentiments from 79,288 of these stories and used the features in a binary classification supported by tree-based methods, ensemble methods (random forest), neural networks, and Support Vector Machines. Our optimized classifiers correctly identified popular stories in four out of five cases. By relating features to classification outcomes using SHAP values, we found that fans prefer longer stories with a wider vocabulary, which can inform the prompts of AI chatbots to continue generating such successful stories. However, we also observed that fans wanted stories unlike the original material (e.g., favoring romance and disliking when characters are hurt), hence AI-powered stories may be less popular if they strictly follow the original material of a show.
Hinweise

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s13278-024-01224-x.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Writers have long followed in the footsteps of their predecessors. Clark Ashton Smith was openly influenced by the style of Edgar Allan Poe (Carter 1976) [p. 110], and Lovecraft made no mystery that he was inspired by the stories of both authors while crafting his own universe. Reusing plot devices or story-stratagems is common practice in genres as diverse as Sword & Sorcery or poetry, as the three times Pullitzer winner MacLeish once put it: “a real writer learns from earlier writers the way a boy learns from an apple orchard–by stealing what he has a taste for, and can carry off” (Carter 1973) [p. 158]. In the case of fanfiction, authors reuse the official material (i.e., the ‘canon’ including original characters and settings) of one or multiple authors, while allowing significant departure in style (e.g., a horror fiction may become a romance). Fanfictions are unlike stories written by professional writers with the agreement of right-holders and a focus on sticking to the canon by developing the viewpoints of minor characters (e.g., the monster squatting the trash compactor in Star Wars (Okorafor 2017)). Fanfictions are a popular literature and transformative work in which possibly amateur writers reuse a universe without seeking approval from right-holders. Motivations include closing gaps in the original story (Koltochikhina and Tsepkova 2020) or recasting content as exemplified by the ‘world-queering’ practice of transforming heteronormative relationships with queer characters (Floegel 2020; Llewellyn 2022). Scholars have used fanfiction to advance writing skills (Sauro and Sundmark 2019; Leigh 2020) owing to its blurry situation between creative writing and literary criticism (Petersen-Reed 2019).
Fanfictions are not a new phenomenon. Goethe’s 18th century novel The Sorrows of Young Werther was followed by hundreds of stories that reused the characters, known as Wertheriaden (Birkhold 2019). Fanfictions related to television shows are also well established, with seminal studies such as Jenkins in the early 1990’s (Jenkins 1992). However, the internet has enabled fanfictions at a new scale, resulting in a ‘web literature’ (Koltochikhina and Tsepkova 2020). The internet enables the production of fanfiction, as authors do not need to fear censorship, legal implications, or other professional consequences (Walls-Thumma 2019). The medium also facilitates the consumption of fanfiction as stories are usually free to read, unlike officially sanctioned works that are typically bought as physical copies (Datlow 2017). Most importantly for this paper, the growing availability of online fanfiction together with the rising sophistication of Natural Language Processing (NLP) techniques has powered new interdisciplinary lines of research in social network analysis and text mining.
Previous analyses of fanfictions with NLP have served to answer a variety of questions. In this paper, we focus on research that is fully automated and performed over a large scale. For example, the recent analysis by McCloskey et al. is out of scope since its corpus was in the scale of hundreds and because NLP methods were used alongside human inspection (McCloskey et al. 2022). Several large-scale, fully automated studies from social computing researchers have investigated fanfiction platforms with respect to community engagement, for example in terms of guiding writers (e.g., mentorship, critical feedback) or fostering solidarity via creative collaborations (e.g., for marginalized groups). For example, researchers found that the reviews provided by peers (known as ‘distributed mentoring’) had a statistically significant impact on a writer’s vocabulary, as measured by lexical diversity (Frens et al. 2018). Reviews have also been examined for sentiment analysis by using a trained classification model (also known as ‘classifier’) to associate emotional responses to text. Certain characters within fanfiction stories can elicit different responses based on their actions and portrayal. Researchers found that the words surrounding mentions of characters had a statistically significant impact on readers’ emotional responses (Milli and Bamman 2016).
Classifiers have also been trained on large number of stories to make predictions, for example by inferring the next spell cast in Harry Potter fanfictions (Vilares et al. 2019). Using data up to March 2016 from the Hugo Award Winning fanfiction hosting site Archive of Our Own (AO3), Jing et al. performed the related task of regression to predict the popularity of a story as a function of the novelty of its word patterns, while controlling for metadata (e.g., age rating, warnings for sensitive elements) (Jing et al. 2019). A more recent study also examined the popularity of stories by collecting a few thousand samples from five domains (e.g., Harry Potter, Twilight), extracting characters and sentiments, and also performing a regression on popularity. This study exemplifies the challenge of predicting popular stories, as the authors concluded to have “found none of the mentioned variables relevant to the popularity of fanfictions”, with adjusted regression scores ranging from 0.14 to 0.28 (Sourati Hassan Zadeh 2022). Our study revisits the challenge of predicting popular stories by focusing on one universe to extract a large corpus and leverage modern methods that go beyond the metadata examined in prior works.
We investigate fanfictions about the TV show Supernatural, which has the largest volume of stories related to a TV show, amounting to over 253, 000 stories on AO3 and more than 126, 000 stories on Fanfiction.net, as of February 2023. The longevity of the show (15 seasons) is attributed in part to its passionate fanbase (calling itself the ‘SPNFamily’), which was noted as having the largest amount on engagement on other social media platforms (Myrick 2019). The show creator has been repeatedly ‘flattered’ by fanfiction (Damore 2019) and supportive of the “inclusive community that’s formed around watching and interacting with the show” (Frith 2015). Although scholars have argued that the relation between the series' creators and the audiences was not always productive (Guirola 2023), this relationship still resulted in incorporating some of the fandom into the show (Zubernis 2021). This contrasts with the tumultuous relations between fanfiction authors and writers in other shows (Michaud Wild 2020), or even the framing of fanfiction as a subversive act (Wang 2019). Due to the massive number of stories, engagement across platforms and interplay with the show creators, many scholarly works have been devoted to both the Supernatural show (Gonçalves 2015) and its fanfictions (Åström 2010; Flegel and Roth 2010; Tosenberger 2008; Herbig and Herrmann 2016), as well as edited volumes (Taylor and Nylander 2019; Macklem et al. 2020). Focusing on Supernatural thus allows us to contribute to an existing body of literature while leveraging a sufficient volume of data to use modern text mining techniques.
Our main contribution is to demonstrate that machine learning techniques can efficiently use high-level descriptions to correctly infer whether a fanfiction is popular four out of five times. Our demonstration rests on three consecutive steps: scrapping stories and their metadata (as in previous studies), performing feature engineering to add 24 features (e.g., number of characters, main characters, tone analysis for the main two protagonists) via Watson NLP and Google’s Natural Language API, and thoroughly optimizing a variety of classifiers (e.g., support vector machines with four types of kernels). In Sect. 2, we provide a succinct background on NLP techniques for fanfictions. The details of our three steps are provided in Sect. 3, culminating in the results shown in Sect. 4. Finally, the implications both for NLP research and fanfiction scholarship are discussed in Sect. 5.

2 Background: natural language processing and fanfiction

If the volume of data only consists of a few dozen fanfictions, then experts can manually perform an accurate thematic analysis (Table 1, bottom three rows). However, as the volume rises to thousands and even millions of stories, researchers have to accept a loss in accuracy in exchange for the ability to perform an analysis at scale. This is the classic trade-off between accuracy and volume encountered for Natural Language Processing across application fields (Galgoczy et al. 2022). Network analyses, sentiment analyses, and thematic analysis are common tools of the trade in NLP research as they serve to link entities (e.g., individuals, places, events), assess the tone of the text, and track the subject of a text; all three analyses can be performed in a single study (Sandhu et al. 2019). In the case of fanfiction, these tools have served to address five questions, which were evoked in the introduction and are elaborated upon here.
Fanfiction websites are not solely about the stories created by individuals. It is also about a community, where readers and authors provide feedback to improve an author’s writings (Frens et al. 2018; Stenger 2021). Since fans interact because of a shared interest, fanfiction communities are “prototypical examples of online affinity spaces and networks” (Cheng and Frens 2022). Fanfiction websites are thus analyzed with respect to both text and community interactions (Frens et al. 2018; Kleindienst and Schmidt 2020). To automatize this analysis, researchers have used the Measure of Textual Lexical Diversity (MTLD) at the level of chapters within each story and found that the MTLD increased with the number of reviews received (Frens et al. 2018). The idea that writers help each other as a community received further evidence in a follow-up study (Froelich et al. 2021). The authors trained the BERT classifier to recognize reviews that provided specific rather than generic feedback, and they found a moderate (albeit statistically significant) correlation between giving and receiving constructive feedback. The structure of the network inferred by relationships between fanfiction readers and authors was also analyzed in a dedicated study (Davis 2021) that leveraged 16 years of data from Fanfiction.net (28 million chapters, 177 million reviews, 10 million people). Collectively, these studies demonstrate how the topic of reviewing fanfiction combines tools of the trade such as network analysis and classifiers.
Table 1
Overview of studies on fanfiction, starting with NLP techniques and contrasting them with manual techniques in the bottom three rows
Ref
Corpus
Methods
Research topic
#stories
Universe
Year
Frens et al. (2018)
6,828,943
Fanfiction.net
2018
Lexical diversity scores
Thematic analysis
Effect of mentoring
Wolska et al. (2022)
7,866,512
AO3
2022
BERT, Support Vector Machines
Predict violence trigger warnings
Schmidt et al. (2022)
82,050
AO3 crossover fiction
2022
Social network analysis
Frequency of appearances of original characters in mixed universes
Vilares et al. (2019)
82,836
Fanfiction.net Harry Potter
2019
Neural networks (LSTM)
Predicting future spells
Yin et al. (2017)
6,807,100
Fanfiction stories
2017
Aggregating metadata
Creating a standardized dataset
Jing et al. (2019)
609,812
AO3
2019
TF-IDF
Thematic analysis (LDA)
Relationship between novelty and popularity
Milli and Bamman (2016)
5,983,038
Fanfiction.net
2016
Relations/actions per character via BookNLP and thematic analysis (LDA) of reviews
Compared characters in fanfiction vs. canon. Also predicted readers’ reactions (80.5% accuracy)
Fedotova et al. (2023)
6,569
ficbook.net for Harry Potter, Marvel, Sherlock, Naruto, and Star Wars
2022
Support Vector Machine and Genetic Algorithms
Authorship attribution for texts in Russian
Kleindienst and Schmidt (2020)
7,000
Supernatural fanfiction on AO3 and original scripts
2020
Unspecified
Compared plot of original show vs. fanfictions
Rowe et al. (2021)
7,000
Harry Potter fanfiction on AO3 and original books
450
Word2Vec
Compared traits of Hermione in books vs. fanfictions
Barker (2002)
100 (unspecified) websites on homo-erotic fanfictions about Buffy
2002
E-mailed questionnaires to authors then manual thematic analysis of responses/fanfiction
Prevalence of same-sex relationships compared to other franchises
Black (2006); Santilli (2010)
Unspecified
Anime section of fanfiction.net
2006
Manual thematic analysis of specific authors over time
Development of literacy skills among young English Language Learners
Black et al. (2019)
35
Harry Potter stories that portray autism
2019
Qualitative thematic analysis
How experiences with autism shapes its portrayal in fanfictions
Characters are central in stories, but automatically tracking them can be arduous because they can be designated through multiple words (e.g., ‘Cynthia’, ‘She’, ‘The sorceress’). A co-reference resolution system identifies characters across multiple words, thus enabling greater textual and character analysis. Several tools have been proposed for co-reference solution, such as Yang’s use of neural networks (LSTM) trained on Jane Austen’s Sense and Sensibility (Yang 2022), or FantasyCoref, trained on Grimm’s Fairy Tales, Alice’s Adventures in Wonderland, and two stories from the Arabian Nights (Han et al. 2021). In the case of fanfiction, the specialized tool is FanfictionNLP (Yoder et al. 2021). The tool can be used to extract and attribute quotes to the right characters, thus enabling studies on how specific characters express themselves throughout a story. It can also serve to create a character network, where nodes represent the characters and edges denote relationships between characters (Schmidt et al. 2022; Labatut and Bost 2019). Such a network can show the most frequent characters and types of relationships (e.g., male–male, female–male) (Schmidt et al. 2022). In another instance, the authors used the network’s signature (e.g., eigenvectors) to infer the genre of the story (Agarwal et al. 2021).
In addition to supporting the lines of inquiry aforementioned, classifiers have been used to address many other questions in fanfiction. Classifiers support sentiment analyses, which can be performed either on the story (Kim and Klinger 2019) or on the reviews (Milli and Bamman 2016) to analyze the readers’ emotional response. Trigger warnings (e.g., sexual content, physical or verbal violence) have also been automatically assigned to stories by training BERT and a Support Vector Machine (Wolska et al. 2022). Researchers have also created classifiers to predict future spells in Harry Potter fanfictions (Vilares et al. 2019), which potentially enables action models for other fandoms or action types.

3 Methods

Our workflow is detailed in the next subsections, following the order summarized in Fig. 1.
For transparency, our scripts, curated dataset, and complete results are accessible without registration on a permanent storage in a third-party repository at https://​osf.​io/​g3p7a/​.

3.1 Data collection

Our focus is to collect fanfiction about Supernatural in English. Given this language constraint, we cannot tap into the vast Internet literature in other languages such as Chinese, where the largest platform (Cloudary Corporation) officially reports 10 million registered users per day (Lu 2016). Fanfictions in English can be found on several websites (Table 2). While Wattpad has been the subject of prior studies on fanfictions, these studies are either qualitative (Budiarto et al. 2021) or use small sample sizes of a few thousand stories (Pianzola et al. 2020). The main two sources by volume are AO3 and Fanfiction.net. This result is aligned with prior works showing that AO3 is by far the main source and has been rising (McCullough 2023), while Fanfiction.net is a secondary source with a declining market share (Fiesler and Dym 2020). Prior works have used either of these two sources (Table 1), as their terms of service are compatible with the use of computer programs to automatically download content (i.e., data scraping).
Table 2
Volume of fanfictions stories on Supernatural (in thousands) per website. We used the main two sources (AO3 and Fanfiction.net) and a sample of 79,288 stories
Website
# of Supernatural Fanfictions
Date created
Archive of Our Own (AO3)
255k
November 2009
Fanfiction.net
126k
October 1998
AsianFanfics
9k
2009
Wattpad
1k
December 2006
On October 2022, we collected Supernatural stories and metadata from both AO3 and Fanfiction.net to achieve a diverse corpus. We used the Python scraper for AO3 created by Jingyi Li and Sarah Sterman (Li et al. 2017). The scraper places one request at most every five seconds. Fanfiction.net uses Cloudflare, which tends to detect web scrapers as hostile bots. We thus used a webdriver (Selenium) to automatize the requests. To avoid creating an undue load on either website and remain within the terms of a fair use,1 we scrapped 100,310 stories from Fanfiction.net and 72,300 from AO3. The scrapers collected metadata alongside each story. Seven features were obtained for both websites: a unique story ID, the title, URL, rating,2 language, publication date, and word count. On AO3, we also obtained the number of ‘kudos’3 (for popularity), number of views, and number of chapters. On Fanfiction.net, we obtained the number of reviews ‘favs’ (for popularity), the number of followers and reviews, the author, and the genre. As exemplified in log-log plots (Fig. 2), there is significant variance in the content of the stories that we collected with respect to the attention that they receive and the size of the story. In Fig. 3, we also note a small correlation between the attention that a story receives (measured by number of reviews) and the extent to which readers endorse it (as measured by ‘like’ or ‘favs’). In the context of online shopping, popular products are defined as “products with many reviews” (Heck et al. 2020), hence we also expect a correlation between measures of popularity in our context.

3.2 Feature engineering

When using traditional classification methods (detailed in the next subsection), metadata is useful but not sufficient to accurately characterize why a story is popular. We thus need to perform feature engineering to extract additional (potentially) informative features. As emphasized by Minaee and colleagues, this “reliance on the hand-crafted features requires tedious feature engineering and analysis to obtain good performance” (Minaee et al. 2021). Indeed, Sect. 2 showed that many options are available, from sentiment to character networks. It is thus common to start by becoming acquainted with the corpus, which informs the choice and configuration of tools for the ensuing automatic analysis. Four readers independently examined three stories each, with a minimum of 1,000 words, and then collectively synthesized characteristics of stories that readers appeared to like. These characteristics were related to whether the story had a summary, a disclaimer, or author notes; the number of locations, characters, and dialogs; the main character, overall emotion of the story4, lexical diversity (i.e., number of unique words) and average word length, and sentiments associated with the two key protagonists of the TV show (Sam and Dean Winchester).
We captured sentiments through seven dimensions for each protagonist (excitement, satisfaction, politeness and lack thereof, sympathetic, frustration, sadness). The distribution of sentiments across stories were similar for the two protagonists and centered on four positive dimensions (excited, satisfied, polite, sympathetic), while the remaining three were much less prevalent (Fig. 4). Note that these seven dimensions are not perfectly orthogonal, as evidenced by the high correlations in Fig. 5; the distributions underlying each correlation are provided online in Supplementary Figures 1 and 2. Overall, our approach added 24 engineered features to the three obtained from web scraping (Table 3). We computed the engineered features using Python libraries including Watson NLP from IBM and Natural Language API from Google. These services start to incur a significant cost at a large scale, hence we created engineered features for a sample of 79,288 stories given a target budget of \(\$5,300\).
Two of the features contain categorical data: the genre provided by the author, and the main character that we detected via NLP. Machine learning algorithms commonly require categorical data to be turned into numerical data. We adopt a common machine learning approach that utilizes “a one-hot encoding technique to convert string labels to numerical labels” (Wanda and Jie 2021). If a given feature had N categorical values, then it is replaced by N binary features which indicate the absence (0) or presence (1) of each possible value.
Table 3
Four features were obtained by the metadata during web scraping. We engineering 24 additional features. We used one-hot encoding for the two categorical features (genre, main character)
Feature
Provenance
Example values
Meaning
Genre
Web scraping (metadata)
Family, Hurt, Supernatural
The author indicates the main themes in their story
#Words
Positive non-zero integer
Auto-populated to count the number of words in the story
#Reviews
Positive integer
Auto-populated to count how many reviews were written by readers
Favs/Kudos
 
Auto-populated to count the popularity of the story (akin to ’likes’)
#Locations
Engineered (NLP)
 
Used NLP to detect and count places in the story
#Characters
 
Used NLP to detect and count (unique) people in the story
Main_Character
Sam, Jo, Eileen.
Used NLP to find the person with the largest number of appearences
Sam_excited
Float in the [0, 1] interval
Used NLP to report the fraction of a story in which Sam’s character expresses certain sentiments. The same process is performed for the other character (Dean)
Sam_satisfied
  
Sam_polite
  
Sam_sympathetic
  
Sam_frustrated
  
Sam_sad
  
Sam_impolite
  
Num_dialogs
Positive integer
Used NLP to count the total number of dialogs in the story
Emotion
Integer
Overall emotion of the story
AvgCharInWord
Engineered
Positive float
Use Python to count the average length of a word
#UniqueWords
Positive integer
Number of unique words in the story, as a proxy to lexical diversity
hasSummary
Engineered (metadata)
Binary
Whether the author provided a summary
hasDisclaimer
 
Whether the author wrote a disclaimer
hasAuthorsNotes
 
Whether the author accompanied the story with notes
We also created a new binary class attribute, whose value (whether a fanfiction is popular or not) is the target of the classification process detailed in the next subsection. ‘Success’ is a fuzzy construct with a subjective interpretation, just like being ‘rich’ or ‘tall’. We set the threshold for a popular story so that the dataset is about evenly split, thus avoiding effects of data imbalance that would be caused by other thresholds. As a result, a ‘successful’ story must be liked by at least ten persons (i.e., ten or more ‘kudos’ for AO3 or ‘favs’ for Fanfiction.net), whereas the other half of stories are ‘unsuccessful’ because they have fewer than 10 likes.

3.3 Classifiers and hyper-parameter optimization

A classifier is a function that predicts a class given certain features. In our case, we use the features described in the previous subsection to predict whether a fanfiction is popular, hence we perform a binary classification. A longstanding practice in text classification is to use different algorithms to train classifiers (Kadhim 2019), in order to identify the right type of function based on performances. For example, some algorithms are well-suited when the data is linearly separable while others are able to make nonlinear cuts (Fig. 6). In addition, certain algorithms specialize in massive amounts of data or in specific data types (e.g., images). Our data consists of 79,288 rows and 25 columns (24 features and 1 class outcome) structured in a tabular format. Since the complexity and the volume of the data are not aligned with deep learning, we focus on a classic approach and employ two of the most commonly used methods for text classification (Aggarwal and Zain 2012): decision trees and support vector machines. Together, these methods cover both linear and nonlinear function hypotheses (Crutzen and Giabbanelli 2014). Decision trees work well with linear decision boundaries (Kowsari et al. 2019) and they can be trained quickly. Support Vector Machines (SVMs) have been regularly employed for classification with text as they can handle non-linear cases using the ‘kernel trick’ (Fig. 6-right); they resemble a logistic regression when using a linear separation. SVMs are among the most resource-intensive models to train (Kadhim 2019), at the exception of deep learning models. In order to provide a comprehensive set of baseline algorithms for comparison, we also include random forests (i.e., sets of decision trees), logistic regression, and a neural network with 8 layers.
For the decision tree, we optimized two hyper-parameters that limit the cuts that can be made and hence force a simplification of the model. Setting a maximum depth to the tree prevents too many successive cuts, which can arise when the algorithm attempts to isolate a few points and hence causes an overfit. When a decision tree makes a cut, it intuitively divides it into ‘left’ and ‘right’ sides, where different features can be selected for the next cuts. The total number of features used by a tree of depth d thus scales with \(2^d\). If we want each of our 24 features to be used potentially at least once, then we need \(2^d > 24 \times 2\) hence we can pick \(d=5\). If we want to over-provision and potentially use each feature three times, then \(d=7\) would suffice. We thus considered three values of the maximum depth to force a simplification, use each feature once, or over-provision. Raising the minimum number of samples to split also avoids creating cuts in areas that lack data. The impact on the tree was discussed in Rosso and Giabbanelli (2018). For a support vector machine, choosing the right kernel is a notoriously difficult problem (Kowsari et al. 2019) hence we considered four types of kernels. The linear kernel is most useful when data can be linearly separated, which particularly applies to text (Pillutla et al. 2020). We employed the Gaussian Radial Basis Function (RBF) kernel as it is a common alternative to a linear kernel, and the most widely used form of kernels relying on an exponential (aternatives include the Laplace RBF kernel of the Gaussian kernel). We chose the sigmoid kernel as a proxy to small neural networks (i.e., a two-layer perceptron), where other options include the hyperbolic tangent kernel. Although polynomial kernels have known limitations (Steinwart 2001), we included them since they have been used in prior works on text classification (Kalcheva et al. 2020). For each of these four kernels, we optimized multiple kernel-dependent hyper-parameters. Since training an SVM is computationally and memory expensive, a comprehensive optimization process could quickly become prohibitive and force us to use heuristics (Dudzik et al. 2021). We thus focused on binary parameters and limited the number of levels for numerical parameters.
The optimization process for the decision tree, random forest, and SVM used a grid search5 to consider all combinations of values listed in Table 4. Since we need to both obtain robust performance estimates and perform a grid search, we divide the data into training, testing, and validation sets. This division is conducted through a 10x10 nested cross validation, also known as a double cross-validation. That is, the data is first split into 10 parts (known as outer folds), nine of which are used for model building and one for testing, until all parts have been involved. For each of the ten instances of model building, this portion of the data is further divided into 10 parts (known as inner folds), nine of which serve to train the model and include a grid search, and the other one serving to measure the initial validation. We used the classic Adam optimizer and gradient descent for the neural network, with Binary Cross Entropy as loss function. Our optimization process provides three metrics (accuracy, precision, recall) for each of the ten outer folds, which allows to compute a confidence interval and thus estimate the robustness of the results vis-á-vis the data.
Table 4
Values of the hyper-parameters used in our optimization by grid search. C is the regularization parameter and it affects the margin of the hyperplane (a higher C leads to a smaller margin). Gamma allows data points further from the hyperplane to be taken into account (a higher gamma accounts for points closer to the hyperplane). The logistic regression has no parameter hence it is not subject to optimization, while the neural network uses a different optimization process
Machine learning approach
Kernel
Hyper-parameters
Values searched for optimization
Decision tree
Maximum depth
3, 5, 7
Minimum number of samples to split
2, 10, 25, 50
Support vector machine
Linear
C
0.01, 0.05, 0.1, 0.5
 
Class weight
none, balanced
Radial basis function
  
 
Gamma
auto, scale
 
Shrinking
True, False
Sigmoid
  
 
Class weight
none, balanced
 
Coeff0
0.0., 0.05, 0.1
Polynomial
  
 
Degree
2, 3, 4
 
Shrinking
True, False
 
Class weight
none, balanced

4 Results

Complete results available in our shared online repository show that insufficient performances were encountered when using Support Vector Machines with either a sigmoid kernel (precision and recall lower than 50%) or a polynomial kernel (recall at most 58.74% and accuracy at most 65.95%). While the neural network had sufficient accuracy (78.38%) and the best precision (80.55%), it was at the expense of very low recall (68.73%). A similar situation was encountered for the logistic regression, with a decent accuracy (76%) and precision (78%) but insufficient recall (59%). The random forest performed as well as the decision tree, as detailed by our complete results on our repository. We thus focus on the satisfactory performances produced by decision trees and Support Vector Machines with linear or RBF kernels (Table 5). Every one of these approaches had its highest score for recall, which intuitively means that models are most trustworthy when they predict that a story is popular. The SVM with RBF kernel had a commendable score for recall but underperformed the other two options by a wide margin on precision and accuracy. The best performances are obtained when using decision trees, which produce scores of approximately 80% in all categories. That is, decision trees were right four out of five times.
While the highest score in each metric may be obtained by different hyper-parameter values, it is necessary for deployment to create one model with a single set of hyper-parameter values. However, it can be arduous to find the values that provide the best performances on all metrics of interest. For example, hyper-parameter values yielding the best four performances for decision trees on recall also produced the worst four performances on precision. We recommend a decision tree with a maximum depth of 7 and minimum number of samples of 2 as it yields the best accuracy (79.51 ± 0.4), the best precision (79.02 ± 1.1), and an average recall (80.44 ± 1.2 by comparison with a minimum of 77.36 and a maximum of 83.36). These hyper-parameter values favor a tree that is able to make more cuts to isolate samples, through both its large depth and low threshold for making a cut.
Table 5
For the best three performing machine learning approaches, we report the top 3 performances with regard to each metric. Note that the top two performances for the RBF kernel had the same scores and parameter apart from shrinking. Results were averaged across the 10 outer folds
ML Approach
Accuracy
Precision
Recall
Score
Params
Score
Params
Score
Params
Decision tree
79.51 ± 0.4
Depth 7, sample 2
79.02 ± 1.1
Depth 7, sample 2
83.36 ± 1.2
Depth 5, any # of samples
79.50 ± 0.4
Depth 7, sample 10
 
Depth 7, sample 25
  
79.49 ± 0.4
Depth 7, sample 25
79.01 ± 1.1
Depth 7, sample 10
  
Linear SVM
76.92 ± 0.6
C 0.01, no weight
75.48 ± 0.7
C 0.01, no weight
83.59 ± 1.4
C 0.5, no weight
76.79 ± 0.7
C 0.01, balanced
75.22 ± 0.8
C 0.01, balanced
83.59 ± 1.7
C 0.5, balanced
76.32 ± 0.8
C 0.05, balanced
73.41 ± 1.1
C 0.05, balanced
83.41 ± 1.0
C 0.1, no weight
Nonlinear SVM (RBF)
67.10 ± 0.7
Balanced, scaled, (no) shrinking
65.53 ± 0.6
Balanced, scaled, (no) shrinking
91.12 ± 0.7
No weight/scale, (no) shrinking
67.09 ± 0.7
No weight, scaled, shrinking
65.50 ± 0.6
No weight, scaled, shrinking
91.11 ± 0.7
No weight/scale, no shrinking
We further investigated the results obtained by our best decision tree using SHAP (SHapley Additive exPlanations) to reveal how features were related to the prediction outcomes. SHAP is a widely used tool to explain machine learning models by deconstructing their predictions into the contributions of individual features, as exemplified by recent studies using SHAP on decision trees (Rodrigo et al. 2021), including boosted trees (Nohara et al. 2022) or ensembles (Campbell et al. 2022). SHAP supports local interpretability because it helps to understand individual predictions rather than how the model works (global interpretability). The feature importance plot in Fig. 7 shows that the number of reviews and unique words are strong predictive variables, followed by stories centered on romance and comfort (as extracted from the stories’ metadata created by the authors). The whereabouts of the main two characters as captured by their emotions were not significant predictors. The direction of the effect (i.e., whether values helped to predict popular or unpopular stories) is shown in Fig. 8. Although several features have a clear direction of effect, it is important to be mindful about the magnitude of this effect. For instance, readers appear to enjoy seeing the character of Sam express frustration or sadness, but either phenomenon is relatively rare within the sample.
As shown by the SHAP values (Figs. 7 and 8), the importance of the number of reviews tells us that we cannot focus exclusively on the content of a story to know whether it will be popular: we also need to know if it attracts attention as measured by the number of reviews. By removing this feature, we exclude popularity metrics and focus on the intrinsic information contained in a story. As summarized in Table 6, removing the number of reviews can noticeably impact performances, depending on the type of machine learning algorithm. Deep neural networks, logistic regressors, decision trees, and linear support vector machines experienced a double-digit performance loss on two or more metrics. The support vector machine with a nonlinear RBF kernel had a more moderate loss and it continues to excel in terms of recall. Performances improve for models that had initially low performances (sigmoid or poly kernels), but they remain lower than alternatives. This confirms that some aspects of a story are predictive of its popularity, and knowing other popularity measures provides even greater predictive ability.
Table 6
We optimized the models after removing the number of reviews. This loss of a key predictive feature generally translates to a loss (\(\triangledown\)) in performance. For models whose performances were already low, the removal may have produced a gain (\(\triangle\)), but the performances are still low
Model
Performances without features
Difference with features
Accuracy
Precision
Recall
Accuracy
Precision
Recall
Deep neural network
67.66
63.72
65.94
\(\triangledown\)10.71
\(\triangledown\)16.82
\(\triangledown\)2.78
Logistic regressor
68
66
45
\(\triangledown\)8
\(\triangledown\)12
\(\triangledown\)14
Decision tree
68.01
66.84
71.7
\(\triangledown\)11.5
\(\triangledown\)12.18
\(\triangledown\)11.66
Linear SVM
68.61
64.37
83.53
\(\triangledown\)8.31
\(\triangledown\)11.11
\(\triangledown\)0.06
Nonlinear (RBF) SVM
63.59
59.2
87.69
\(\triangledown\)3.51
\(\triangledown\)6.33
\(\triangledown\)3.43
Nonlinear (sigmoid) SVM
58.79
58.83
58.82
\(\triangle\)8.86
\(\triangle\)8.87
\(\triangle\)9
Nonlinear (poly) SVM
65.65
68.04
59.13
\(\triangledown\)0.29
\(\triangledown\)0.62
\(\triangle\)0.39

5 Discussion

The growing volume of online fanfiction has been the subject of numerous studies, either from the perspective of text mining by using Natural Language Processing or through a qualitative lens via a manual examination (Table 1). We contribute to these efforts by using classifiers to determine the popularity of fanfiction stories regarding the show Supernatural, chosen for the large available corpus size as well as extensive scholarship, ranging from articles such as Åström (2010), Flegel and Roth (2010), Tosenberger (2008), Herbig and Herrmann (2016), Zubernis (2021) to the thesis of Guirola (2023) and edited volumes by Taylor and Nylander (2019), Macklem et al. (2020), and Wilkinson (2013). We show that it is possible to accurately predict whether a story is popular in four out of five times based on high-level features.
By using local interpretability techniques for Machine Learning (i.e., SHAP), we were able to relate specific features to the popularity of stories (Sect. 4). Our findings can be summarized in the following three takeaways. First, a large number of reviews is indicative of getting good attention. This suggests similarities in taste among the readership, as popular stories amass many more reviews. Second, fans tend to like longer stories and also those that have a wider vocabulary. We posit that it is not merely about wanting ‘more’, but an indicator of overall writing quality and efforts from the writer. Third, readers enjoy romantic stories and have mixed feelings when characters get hurt (which is a frequent occurrence on the TV show). This is noteworthy because the original show Supernatural may be categorized as action, adventure, drama, fantasy, horror, or mystery—but not as a romance. Other scholars have noted that “many fans shipped the main, male characters together” despite the initially heteronormative lens of the TV show, hence “showrunners supported that interpretation by incorporating seemingly romantic glances” that were occasionally perceived as queerbaiting (Church 2023). Our large-scale analysis thus confirms the use of Fan Fictions to complement the show by venturing into themes that it did not extensively cover.
There are several limitations to our study. First, despite a sizable corpus of dozens of thousands of stories, we do not have an extensive sample on every writing style hence we refrain from making conclusions on patterns that are visible but only appear in a few stories. For example, we observed that humor was always a success, as unpopular stories were low in humor whereas popular stories had a more marked amount of humor. The type of humor did not seem to have an impact since the distribution of effects is very one sided. However, these effects were based on a small sample size, hence they cannot be broadly generalized to all fanfiction. Although our study was based on the most studied TV show in fanfiction, we also note that our findings do not automatically generalize to other TV shows or to fanfiction in general. Additionally, similar results may have been obtained for a lower computing cost, as our training process intended to be thorough out of an abundance of caution. For instance, a nested cross validation has been characterized as ‘overzealous’ at times as its high computing cost may only provide a minor improvement in model estimates compared to the optimization procedures used by AutoML or Auto-Sklearn (Wainer and Cawley 2021). In a similar way, some of the Support Vector Machine kernels (particularly the polynomial) may have been avoided, and they were only included to be in line with prior works on text classification. Finally, we performed a binary classification in order to have a clear notion of popularity and balance the data, but popularity is a continuum rather than a dichotomy. An alternative would be to either perform a multiclass classification with multiple levels of popularity, or to treat popularity as a continuous attribute and opt for a regression.
Several years ago, fanfiction scholars already anticipated that AI chatbots would be involved alongside humans in the writing process (Lamerichs 2018). GPT-4 and other pre-trained large-scale language models (LLMs) have made it a reality, resulting in a rapid acceleration of AI as (co)writers for fanfictions (Rosenberg 2023). Our findings are particularly informative in this new environment by showing which features lead to popular stories, which may facilitate the (semi)automatic generation of such stories. For example, knowing that fans prefer longer stories with a wider vocabulary is helpful when prompting GPT-4 or other AI assistants. We note that these assistants are best at replicating what they have already seen, but our analysis shows that fans prefer stories to touch on themes that were not necessarily prevalent in the show. This information may lead to engineering prompts that explicitly ask to incorporate features (e.g., romance) that would not otherwise be generated solely by relying on the training data.
It is possible that a classifier leveraging LLMs yields higher performance measures (e.g., accuracy, precision, recall). This would be a different approach, as the entire story would be encoded (e.g., with word2vec or TF-IDF) instead of extracting specific features with a known meaning. Our goal was to transparently relate features to the popularity of stories, rather than maximize performance measures. A complementary study using the latest deep learning approaches such as DeBERTaV3 (He et al. 2021) would thus be a useful follow-up. Such methods based on Deep Neural Networks provide the state-of-the-art when the objective is to maximize a performance measure for natural language processing (Suissa et al. 2022). Since pre-trained models have been exposed to different datasets, they can encode different knowledge models and hence perform differently in a given application. A study using deep learning would thus have to employ several techniques, in the same manner as we used and optimized different algorithms. In addition, deep learning models may need fine-tuning to ensure that they are adapted to the application context, which can require human-annotated data as shown in our prior applied study with BERT (Galgoczy et al. 2022).

Acknowledgements

We thank Adrian Bozdog and Johnathan Uptegraph for their assistance with data gathering and feature extraction. We are indebted to Dr. Arthur Carvalho for providing computing power to enable a large-scale feature extraction. We appreciate the continued support of Dr. Jens Mueller, as the Miami Redhawk cluster was instrumental to run our machine learning analysis.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Informed consent is not applicable as no human or animal subjects were involved in this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Supplementary Information

Below is the link to the electronic supplementary material.
Fußnoten
1
The Terms of Service (TOS) for AO3 state “Using bots or scraping is not against our Terms of Service unless it relates to our guidelines against spam or other activities” (Archive of Our Own 2023b). However, due to the growth of scraping from Generative AI bots, the website has “put in place certain technical measures to hinder large-scale data scraping on AO3, such as rate limiting” (Archive of Our Own 2023a). We thus follow the TOS by staying within these limitations. For Fanfiction.net, the TOS states (emphasis added): “You agree not to use or launch any automated system, including without limitation, ‘robots,’ ‘spiders,’ or ‘offline readers,’ that accesses the Service in a manner that sends more request messages to the FanFiction.Net servers in a given period of time than a human can reasonably produce in the same period by using a conventional on-line web browser” (FanFiction 2019) Using Selenium mimics the behavior of a human relying on a browser and jumping between stories. We never exceed the load associated with a human per period of time, hence our data collection is performed over several days.
 
2
The rating identifies the intended audience, in a manner similar to movies. For instance, www.​fictionratings.​com states that a M16+ rating indicates mature content that is “not suitable for children or teens below the age of 16 with non-explicit suggestive adult themes, references to some violence, or coarse language.” Fanfiction.net uses the FictionRatings system, whereas AO3 uses five categories: not rated, general audiences, teen and up audiences, explicit, mature.
 
3
A ‘kudo’ is the approach of AO3 to record that a user ‘liked’ a story. In the same manner, ‘favoriting’ (i.e., ‘favs’) a story on Fanfiction.net informs the author that the reader liked the story. These mechanisms are equivalent to clicking ‘like’ or a thumb up on other websites.
 
4
We used the TextBlob library and applied its sentiment polarity function on each sentence to determine if it leaned positive or negative.
 
5
A grid search is a standard process in classification to optimize the hyper-parameters. This ensures that the results obtained for each classification algorithm correspond to the best that can be achieved on the dataset. Consequently, results can be compared across algorithms. This is an optimization of machine learning algorithms so the name should not be interpreted to mean a search in the data directly. The term ‘grid search’ evokes an exhaustive search: if there were two parameters, we evaluate the classification algorithm for each pair of values, which visually form a grid (with one parameter as X-axis and the other as Y-axis). The name remains even if there are more than two parameters.
 
Literatur
Zurück zum Zitat Agarwal D, Vijay D, et al. (2021) Genre classification using character networks. In: 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS). pp. 216–222. IEEE Agarwal D, Vijay D, et al. (2021) Genre classification using character networks. In: 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS). pp. 216–222. IEEE
Zurück zum Zitat Aggarwal CC, Zhai C (2012) A survey of text classification algorithms, pp. 163–222. Springer Aggarwal CC, Zhai C (2012) A survey of text classification algorithms, pp. 163–222. Springer
Zurück zum Zitat Åström B (2010) ’let’s get those winchesters pregnant’: Male pregnancy in supernatural fan fiction. Transformative works and cultures 4(1) Åström B (2010) ’let’s get those winchesters pregnant’: Male pregnancy in supernatural fan fiction. Transformative works and cultures 4(1)
Zurück zum Zitat Birkhold MH (2019) Characters Before Copyright: The Rise and Regulation of Fan Fiction in Eighteenth-Century Germany. Oxford University Press Birkhold MH (2019) Characters Before Copyright: The Rise and Regulation of Fan Fiction in Eighteenth-Century Germany. Oxford University Press
Zurück zum Zitat Black R, Alexander J, Chen V, Duarte J (2019) Representations of autism in online harry potter fanfiction. J Lit Res 51(1):30–51CrossRef Black R, Alexander J, Chen V, Duarte J (2019) Representations of autism in online harry potter fanfiction. J Lit Res 51(1):30–51CrossRef
Zurück zum Zitat Black RW (2006) Language, culture, and identity in online fanfiction. E-learn Digit Media 3(2):170–184ADSCrossRef Black RW (2006) Language, culture, and identity in online fanfiction. E-learn Digit Media 3(2):170–184ADSCrossRef
Zurück zum Zitat Budiarto A, Chairunissa R, Fitriani A (2021) Motivation behind writing fanfictions for digital authors on wattpad and twitter. Alphabet: A Biannual Acad J Lang Lit Cultural Stud 4(1): 48–53 Budiarto A, Chairunissa R, Fitriani A (2021) Motivation behind writing fanfictions for digital authors on wattpad and twitter. Alphabet: A Biannual Acad J Lang Lit Cultural Stud 4(1): 48–53
Zurück zum Zitat Campbell TW, Roder H, Georgantas RW III, Roder J (2022) Exact shapley values for local and model-true explanations of decision tree ensembles. Mach Learn Appl 9:100345 Campbell TW, Roder H, Georgantas RW III, Roder J (2022) Exact shapley values for local and model-true explanations of decision tree ensembles. Mach Learn Appl 9:100345
Zurück zum Zitat Carter L (1973) Imaginary Worlds. Ballantine Books, New York, USA Carter L (1973) Imaginary Worlds. Ballantine Books, New York, USA
Zurück zum Zitat Carter L (1976) Kingdoms of Sorcery: An Anthology of Adult Fantasy. Doubleday and Company, Garden City, New York, USA Carter L (1976) Kingdoms of Sorcery: An Anthology of Adult Fantasy. Doubleday and Company, Garden City, New York, USA
Zurück zum Zitat Cheng R, Frens J (2022) Feedback exchange and online affinity: A case study of online fanfiction writers. arXiv preprint arXiv:2209.12810 Cheng R, Frens J (2022) Feedback exchange and online affinity: A case study of online fanfiction writers. arXiv preprint arXiv:​2209.​12810
Zurück zum Zitat Church J (2023) # supercorp kissed.... or did they?: lesbian fandom and queerbaiting. J Lesbian Stud pp. 1–17 Church J (2023) # supercorp kissed.... or did they?: lesbian fandom and queerbaiting. J Lesbian Stud pp. 1–17
Zurück zum Zitat Crutzen R, Giabbanelli P (2014) Using classifiers to identify binge drinkers based on drinking motives. Substance Use Misuse 49(1–2):110–115CrossRefPubMed Crutzen R, Giabbanelli P (2014) Using classifiers to identify binge drinkers based on drinking motives. Substance Use Misuse 49(1–2):110–115CrossRefPubMed
Zurück zum Zitat Datlow E (ed) (2017) Mad Hatters and March Hares. Tor, New York, USA Datlow E (ed) (2017) Mad Hatters and March Hares. Tor, New York, USA
Zurück zum Zitat Davis R, Frens J, Sharma N, Muralikumar MD, Aragon C, Evans S (2021) Mentorship network structure: How relationships emerge online and what they mean for amateur creators. arXiv preprint arXiv:2106.14111 Davis R, Frens J, Sharma N, Muralikumar MD, Aragon C, Evans S (2021) Mentorship network structure: How relationships emerge online and what they mean for amateur creators. arXiv preprint arXiv:​2106.​14111
Zurück zum Zitat Dudzik W, Nalepa J, Kawulok M (2021) Evolving data-adaptive support vector machines for binary classification. Knowl Based Syst 227:107221CrossRef Dudzik W, Nalepa J, Kawulok M (2021) Evolving data-adaptive support vector machines for binary classification. Knowl Based Syst 227:107221CrossRef
Zurück zum Zitat Fedotova A, Romanov A, Kurtukova A, Shelupanov A (2023) Digital authorship attribution in Russian-language fanfiction and classical literature. Algorithms 16(1):13CrossRef Fedotova A, Romanov A, Kurtukova A, Shelupanov A (2023) Digital authorship attribution in Russian-language fanfiction and classical literature. Algorithms 16(1):13CrossRef
Zurück zum Zitat Fiesler C, Dym B (2020) Moving across lands: online platform migration in fandom communities. Proc ACM Human Comput Interact 4(CSCW1):1–25CrossRef Fiesler C, Dym B (2020) Moving across lands: online platform migration in fandom communities. Proc ACM Human Comput Interact 4(CSCW1):1–25CrossRef
Zurück zum Zitat Flegel, M., Roth, J.: Annihilating love and heterosexuality without women: Romance, generic difference, and queer politics in supernatural fan fiction. Transform Works Cult 4(0) (2010) Flegel, M., Roth, J.: Annihilating love and heterosexuality without women: Romance, generic difference, and queer politics in supernatural fan fiction. Transform Works Cult 4(0) (2010)
Zurück zum Zitat Floegel D (2020) Write the story you want to read”: world-queering through slash fanfiction creation. J Document Floegel D (2020) Write the story you want to read”: world-queering through slash fanfiction creation. J Document
Zurück zum Zitat Frens J, Davis R, Lee J, Zhang D, Aragon C (2018) Reviews matter: how distributed mentoring predicts lexical diversity on fanfiction. net. arXiv preprint arXiv:1809.10268 Frens J, Davis R, Lee J, Zhang D, Aragon C (2018) Reviews matter: how distributed mentoring predicts lexical diversity on fanfiction. net. arXiv preprint arXiv:​1809.​10268
Zurück zum Zitat Froelich N, Liu A, Shang R, Xiao Z, Neils T, Frens J, Aragon C (2021) Reciprocity in reviewing on fanfiction. net. In: HCI International 2021-Posters: 23rd HCI International Conference, HCII 2021, Virtual Event, July 24–29, 2021, Proceedings, Part III 23. pp. 39–44. Springer Froelich N, Liu A, Shang R, Xiao Z, Neils T, Frens J, Aragon C (2021) Reciprocity in reviewing on fanfiction. net. In: HCI International 2021-Posters: 23rd HCI International Conference, HCII 2021, Virtual Event, July 24–29, 2021, Proceedings, Part III 23. pp. 39–44. Springer
Zurück zum Zitat Galgoczy MC, Phatak A, Vinson D, Mago VK, Giabbanelli PJ (2022) (re) shaping online narratives: when bots promote the message of president trump during his first impeachment. PeerJ Comput Sci 8:e947CrossRefPubMedPubMedCentral Galgoczy MC, Phatak A, Vinson D, Mago VK, Giabbanelli PJ (2022) (re) shaping online narratives: when bots promote the message of president trump during his first impeachment. PeerJ Comput Sci 8:e947CrossRefPubMedPubMedCentral
Zurück zum Zitat Gonçalves D (2015) Popping (it) up: an exploration on popular culture and tv series supernatural. Diffractions 4:1–24 Gonçalves D (2015) Popping (it) up: an exploration on popular culture and tv series supernatural. Diffractions 4:1–24
Zurück zum Zitat Guirola CC (2023) “Fine, I’ll Write It Myself”: Rhetorical Practices of LGBTQIA+ Fandom Communities as Activism. Master’s thesis, California State University, Fresno Guirola CC (2023) “Fine, I’ll Write It Myself”: Rhetorical Practices of LGBTQIA+ Fandom Communities as Activism. Master’s thesis, California State University, Fresno
Zurück zum Zitat Han S, Seo S, Kang M, Kim J, Choi N, Song M, Choi JD (2021) Fantasycoref: Coreference resolution on fantasy literature through omniscient writer’s point of view. In: Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference. pp. 24–35 Han S, Seo S, Kang M, Kim J, Choi N, Song M, Choi JD (2021) Fantasycoref: Coreference resolution on fantasy literature through omniscient writer’s point of view. In: Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference. pp. 24–35
Zurück zum Zitat He P, Gao J, Chen W (2021) Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 He P, Gao J, Chen W (2021) Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:​2111.​09543
Zurück zum Zitat Heck DW, Seiling L, Bröder A (2020) The love of large numbers revisited: A coherence model of the popularity bias. Cognition 195:104069CrossRefPubMed Heck DW, Seiling L, Bröder A (2020) The love of large numbers revisited: A coherence model of the popularity bias. Cognition 195:104069CrossRefPubMed
Zurück zum Zitat Herbig A, Herrmann AF (2016) Polymediated narrative: the case of the supernatural episode" fan fiction". Int J Commun 10:18 Herbig A, Herrmann AF (2016) Polymediated narrative: the case of the supernatural episode" fan fiction". Int J Commun 10:18
Zurück zum Zitat Jenkins H (1992) Textual Poachers: Television Fans and Participatory Culture. Routledge Jenkins H (1992) Textual Poachers: Television Fans and Participatory Culture. Routledge
Zurück zum Zitat Jing E, DeDeo S, Ahn YY (2019) Sameness attracts, novelty disturbs, but outliers flourish in fanfiction online. arXiv preprint arXiv:1904.07741 Jing E, DeDeo S, Ahn YY (2019) Sameness attracts, novelty disturbs, but outliers flourish in fanfiction online. arXiv preprint arXiv:​1904.​07741
Zurück zum Zitat Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52(1):273–292MathSciNetCrossRef Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52(1):273–292MathSciNetCrossRef
Zurück zum Zitat Kalcheva N, Karova M, Penev I (2020) Comparison of the accuracy of svm kemel functions in text classification. In: 2020 International Conference on Biomedical Innovations and Applications (BIA). pp. 141–145. IEEE Kalcheva N, Karova M, Penev I (2020) Comparison of the accuracy of svm kemel functions in text classification. In: 2020 International Conference on Biomedical Innovations and Applications (BIA). pp. 141–145. IEEE
Zurück zum Zitat Kim E, Klinger R (2019) An analysis of emotion communication channels in fan fiction: towards emotional storytelling. arXiv preprint arXiv:1906.02402 Kim E, Klinger R (2019) An analysis of emotion communication channels in fan fiction: towards emotional storytelling. arXiv preprint arXiv:​1906.​02402
Zurück zum Zitat Kleindienst, N., Schmidt, T.: Investigating the transformation of original work by the online fan fiction community: A case study for supernatural. In: Digital Practices. Reading, Writing and Evaluation on the Web (November 2020), https://epub.uni-regensburg.de/50828/ Kleindienst, N., Schmidt, T.: Investigating the transformation of original work by the online fan fiction community: A case study for supernatural. In: Digital Practices. Reading, Writing and Evaluation on the Web (November 2020), https://​epub.​uni-regensburg.​de/​50828/​
Zurück zum Zitat Koltochikhina, E., Tsepkova, A.: The status and pecularities of fanfiction as a phenomenon of contemporary popular culture. Urgent Problems of Modern Society: Language, Culture and Technology in the Changing World 61 (2020) Koltochikhina, E., Tsepkova, A.: The status and pecularities of fanfiction as a phenomenon of contemporary popular culture. Urgent Problems of Modern Society: Language, Culture and Technology in the Changing World 61 (2020)
Zurück zum Zitat Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150CrossRef Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150CrossRef
Zurück zum Zitat Labatut V, Bost X (2019) Extraction and analysis of fictional character networks: a survey. ACM Comput Surv (CSUR) 52(5):1–40CrossRef Labatut V, Bost X (2019) Extraction and analysis of fictional character networks: a survey. ACM Comput Surv (CSUR) 52(5):1–40CrossRef
Zurück zum Zitat Lamerichs N (2018) The next wave in participatory culture: Mixing human and nonhuman entities in creative practices and fandom. The Future of Fandom (28) Lamerichs N (2018) The next wave in participatory culture: Mixing human and nonhuman entities in creative practices and fandom. The Future of Fandom (28)
Zurück zum Zitat Leigh S (2020) Fan fiction as a valuable literacy practice. Transform Works Cult 34:1–4 Leigh S (2020) Fan fiction as a valuable literacy practice. Transform Works Cult 34:1–4
Zurück zum Zitat Llewellyn A (2022) space where queer is normalized: The online world and fanfictions as heterotopias for wlw. J Homosexuality 69(13):2348–2369 Llewellyn A (2022) space where queer is normalized: The online world and fanfictions as heterotopias for wlw. J Homosexuality 69(13):2348–2369
Zurück zum Zitat Lu J (2016) Chinese historical fan fiction internet writers and internet literature. Pacific Coast Philol 51(2):159–176CrossRef Lu J (2016) Chinese historical fan fiction internet writers and internet literature. Pacific Coast Philol 51(2):159–176CrossRef
Zurück zum Zitat Macklem L, Grace D (eds) (2020) Supernatural Out of the Box: Essays on the Metatextuality of the Series. McFarland & Company, Jefferson, North Carolina, USA Macklem L, Grace D (eds) (2020) Supernatural Out of the Box: Essays on the Metatextuality of the Series. McFarland & Company, Jefferson, North Carolina, USA
Zurück zum Zitat McCloskey K, Ramírez-Esparza N, Johnson BT (2022) Strange new worlds: social content in popular star trek fanfiction versus commercial novels. Psychol Popular Media 11(2):152CrossRef McCloskey K, Ramírez-Esparza N, Johnson BT (2022) Strange new worlds: social content in popular star trek fanfiction versus commercial novels. Psychol Popular Media 11(2):152CrossRef
Zurück zum Zitat Michaud Wild N (2020) The active defense of fanfiction writing: Sherlock fans’ metatextual response. Eur J Cultural Stud 23(2):244–260CrossRef Michaud Wild N (2020) The active defense of fanfiction writing: Sherlock fans’ metatextual response. Eur J Cultural Stud 23(2):244–260CrossRef
Zurück zum Zitat Milli, S., Bamman, D.: Beyond canonical texts: A computational analysis of fanfiction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2048–2053 (2016) Milli, S., Bamman, D.: Beyond canonical texts: A computational analysis of fanfiction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2048–2053 (2016)
Zurück zum Zitat Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Comput Surv (CSUR) 54(3):1–40CrossRef Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Comput Surv (CSUR) 54(3):1–40CrossRef
Zurück zum Zitat Nohara Y, Matsumoto K, Soejima H, Nakashima N (2022) Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput Meth Programs Biomed 214:106584CrossRef Nohara Y, Matsumoto K, Soejima H, Nakashima N (2022) Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput Meth Programs Biomed 214:106584CrossRef
Zurück zum Zitat Petersen-Reed KA (2019) Fanfiction as performative criticism: Harry potter racebending. J Creat Writ Stud 4(1):10 Petersen-Reed KA (2019) Fanfiction as performative criticism: Harry potter racebending. J Creat Writ Stud 4(1):10
Zurück zum Zitat Pianzola F, Rebora S, Lauer G (2020) Wattpad as a resource for literary studies. quantitative and qualitative examples of the importance of digital social reading and readers’ comments in the margins. PloS one 15(1): e0226708 Pianzola F, Rebora S, Lauer G (2020) Wattpad as a resource for literary studies. quantitative and qualitative examples of the importance of digital social reading and readers’ comments in the margins. PloS one 15(1): e0226708
Zurück zum Zitat Pillutla VS, Tawfik AA, Giabbanelli PJ (2020) Detecting the depth and progression of learning in massive open online courses by mining discussion data. Technol Knowl Learn 25(4):881–898CrossRef Pillutla VS, Tawfik AA, Giabbanelli PJ (2020) Detecting the depth and progression of learning in massive open online courses by mining discussion data. Technol Knowl Learn 25(4):881–898CrossRef
Zurück zum Zitat Rodrigo H, Beukes EW, Andersson G, Manchaiah V (2021) Exploratory data mining techniques (decision tree models) for examining the impact of internet-based cognitive behavioral therapy for tinnitus: Machine learning approach. J Med Intern Res 23(11):e28999 Rodrigo H, Beukes EW, Andersson G, Manchaiah V (2021) Exploratory data mining techniques (decision tree models) for examining the impact of internet-based cognitive behavioral therapy for tinnitus: Machine learning approach. J Med Intern Res 23(11):e28999
Zurück zum Zitat Rosso N, Giabbanelli P et al (2018) Accurately inferring compliance to five major food guidelines through simplified surveys: applying data mining to the uk national diet and nutrition survey. JMIR Public Health Surveillance 4(2):e9536CrossRef Rosso N, Giabbanelli P et al (2018) Accurately inferring compliance to five major food guidelines through simplified surveys: applying data mining to the uk national diet and nutrition survey. JMIR Public Health Surveillance 4(2):e9536CrossRef
Zurück zum Zitat Rowe, R., Henderson, T., Wang, T.: Text mining, hermione granger, and fan fiction: What’s in a name? Transformative Works and Cultures 36 (2021) Rowe, R., Henderson, T., Wang, T.: Text mining, hermione granger, and fan fiction: What’s in a name? Transformative Works and Cultures 36 (2021)
Zurück zum Zitat Sandhu M, Vinson CD, Mago VK, Giabbanelli PJ (2019) From associations to sarcasm: mining the shift of opinions regarding the supreme court on twitter. Online Social Netw Media 14:100054CrossRef Sandhu M, Vinson CD, Mago VK, Giabbanelli PJ (2019) From associations to sarcasm: mining the shift of opinions regarding the supreme court on twitter. Online Social Netw Media 14:100054CrossRef
Zurück zum Zitat Santilli N (2010) Online publishing:(anime) fan fiction and identity. J Digit Res Publish 3(1):40–47 Santilli N (2010) Online publishing:(anime) fan fiction and identity. J Digit Res Publish 3(1):40–47
Zurück zum Zitat Sauro S, Sundmark B (2019) Critically examining the use of blog-based fanfiction in the advanced language classroom. ReCALL 31(1):40–55CrossRef Sauro S, Sundmark B (2019) Critically examining the use of blog-based fanfiction in the advanced language classroom. ReCALL 31(1):40–55CrossRef
Zurück zum Zitat Schmidt T, Hoffmann J, Wolff C (2022) Analyzing character networks in crossover fan fictions of archive of our own Schmidt T, Hoffmann J, Wolff C (2022) Analyzing character networks in crossover fan fictions of archive of our own
Zurück zum Zitat Sourati Hassan Zadeh Z, Sabri N, Chamani H, Bahrak B (2022) Quantitative analysis of fanfictions’ popularity. Social Netw Anal Mining 12(1):42CrossRef Sourati Hassan Zadeh Z, Sabri N, Chamani H, Bahrak B (2022) Quantitative analysis of fanfictions’ popularity. Social Netw Anal Mining 12(1):42CrossRef
Zurück zum Zitat Steinwart I (2001) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2(Nov), 67–93 Steinwart I (2001) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2(Nov), 67–93
Zurück zum Zitat Stenger J (2021) The datafication of fandom, pp. 255–276. University of Iowa Press, Iowa City, Iowa, USA Stenger J (2021) The datafication of fandom, pp. 255–276. University of Iowa Press, Iowa City, Iowa, USA
Zurück zum Zitat Suissa O, Elmalech A, Zhitomirsky-Geffet M (2022) Text analysis using deep neural networks in digital humanities and information science. J Assoc Inf Sci Technol 73(2):268–287CrossRef Suissa O, Elmalech A, Zhitomirsky-Geffet M (2022) Text analysis using deep neural networks in digital humanities and information science. J Assoc Inf Sci Technol 73(2):268–287CrossRef
Zurück zum Zitat Taylor A, Nylander S (eds) (2019) Death in Supernatural: Critical Essays. McFarland & Company, Jefferson, North Carolina, USA Taylor A, Nylander S (eds) (2019) Death in Supernatural: Critical Essays. McFarland & Company, Jefferson, North Carolina, USA
Zurück zum Zitat Tosenberger C (2008) " the epic love story of sam and dean": supernatural, queer readings, and the romance of incestuous fan fiction. Transform Works Cultures 1 Tosenberger C (2008) " the epic love story of sam and dean": supernatural, queer readings, and the romance of incestuous fan fiction. Transform Works Cultures 1
Zurück zum Zitat Vilares D, Gómez-Rodríguez C (2019) Harry potter and the action prediction challenge from natural language. arXiv preprint arXiv:1905.11037 Vilares D, Gómez-Rodríguez C (2019) Harry potter and the action prediction challenge from natural language. arXiv preprint arXiv:​1905.​11037
Zurück zum Zitat Wainer J, Cawley G (2021) Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst Appl 182:115222CrossRef Wainer J, Cawley G (2021) Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst Appl 182:115222CrossRef
Zurück zum Zitat Walls-Thumma DM (2019) Affirmational and transformational values and practices in the tolkien fanfiction community. J Tolkien Res 8(1):6 Walls-Thumma DM (2019) Affirmational and transformational values and practices in the tolkien fanfiction community. J Tolkien Res 8(1):6
Zurück zum Zitat Wanda P, Jie H (2021) Deepfriend: finding abnormal nodes in online social networks using dynamic deep learning. Soc Netw Anal Mining 11(34) Wanda P, Jie H (2021) Deepfriend: finding abnormal nodes in online social networks using dynamic deep learning. Soc Netw Anal Mining 11(34)
Zurück zum Zitat Wang CY (2019) Officially sanctioned adaptation and affective fan resistance: The transmedia convergence of the online drama guardian in china. Series Int J TV Serial Narrat 5(2):45–58 Wang CY (2019) Officially sanctioned adaptation and affective fan resistance: The transmedia convergence of the online drama guardian in china. Series Int J TV Serial Narrat 5(2):45–58
Zurück zum Zitat Wilkinson J (2013) The epic love story of supernatural and fanfic. In: Jamison A (ed.) Fic: Why Fanfiction Is Taking Over the World, pp. 309–315 Wilkinson J (2013) The epic love story of supernatural and fanfic. In: Jamison A (ed.) Fic: Why Fanfiction Is Taking Over the World, pp. 309–315
Zurück zum Zitat Wolska M, Schröder C, Borchardt O, Stein B, Potthast M (2022) Trigger warnings: Bootstrapping a violence detector for fanfiction. arXiv preprint arXiv:2209.04409 Wolska M, Schröder C, Borchardt O, Stein B, Potthast M (2022) Trigger warnings: Bootstrapping a violence detector for fanfiction. arXiv preprint arXiv:​2209.​04409
Zurück zum Zitat Yang F (2022) An extraction and representation pipeline for literary characters. Proc AAAI Conf Artif Intell 36:13146–13147 Yang F (2022) An extraction and representation pipeline for literary characters. Proc AAAI Conf Artif Intell 36:13146–13147
Zurück zum Zitat Yin K, Aragon C, Evans S, Davis K (2017) Where no one has gone before: A meta-dataset of the world’s largest fanfiction repository. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 6106–6110 Yin K, Aragon C, Evans S, Davis K (2017) Where no one has gone before: A meta-dataset of the world’s largest fanfiction repository. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 6106–6110
Zurück zum Zitat Yoder MM, Khosla S, Shen Q, Naik A, Jin H, Muralidharan H, Rosé CP (2021) Fanfictionnlp: A text processing pipeline for fanfiction. In: The 3rd Workshop on Narrative Understanding Yoder MM, Khosla S, Shen Q, Naik A, Jin H, Muralidharan H, Rosé CP (2021) Fanfictionnlp: A text processing pipeline for fanfiction. In: The 3rd Workshop on Narrative Understanding
Zurück zum Zitat Zubernis LS (2021) The spnfamily: Supernatural and the fandom like no other. MONSTRUM 3 Zubernis LS (2021) The spnfamily: Supernatural and the fandom like no other. MONSTRUM 3
Metadaten
Titel
Big data meets storytelling: using machine learning to predict popular fanfiction
verfasst von
Duy Nguyen
Stephen Zigmond
Samuel Glassco
Bach Tran
Philippe J. Giabbanelli
Publikationsdatum
01.12.2024
Verlag
Springer Vienna
Erschienen in
Social Network Analysis and Mining / Ausgabe 1/2024
Print ISSN: 1869-5450
Elektronische ISSN: 1869-5469
DOI
https://doi.org/10.1007/s13278-024-01224-x

Weitere Artikel der Ausgabe 1/2024

Social Network Analysis and Mining 1/2024 Zur Ausgabe

Premium Partner