1 Introduction
2 Related work
2.1 Functional data analysis and time series classification
2.2 Authorship attribution
Paper | Language | Text style | Average sample text length (words) | Number of samples | Number of classes |
---|---|---|---|---|---|
English | Federalist Papers | 900 to 3500 | 85 | 3 | |
[33] | English | Newspaper articles | 89*** | 112 | 50 |
714**** | 14 | 50 | |||
[28] | English | Incriminating digital documents | 290 | 69 | 10 |
[29] | Modern Greek | Digital messages | 1209 | 250 | 10 |
Newspaper articles | 1007.5 | 400 | 20 | ||
[30] | Modern Greek | Greek Parliament | 1590 | 341 | 5 |
Register | 2871 | 127 | 5 | ||
1285 | 1005 | 5 | |||
[31] | German | Newspaper articles | 438 | 1200 | 2* |
480 | 550 | 2* | |||
357 | 3233 | 2* | |||
[32] | English | Digital messages | 169 | 300 to 400 | 10 |
Chinese | 807** | 300 to 400 | 10 | ||
[34] | English | Book chapters | N/A | 1960 to 2450 | 15 |
[35] | English | Variate types from the ad-hoc authorship attribution contest | Hundreds to thousands | 7 to 38 | 3 to 13 |
[36] | English | Works of Shakespeare and Fletcher | 1000 | 100 | 2 |
[37] | Belgian | Newspaper articles | 600 | 300 | 3 |
[38] | Modern Greek | Newspaper articles | 866.8 | 200 | 10 |
1148.2 | 200 | 10 | |||
[39] | English | Novels written by Bronte Sisters | 1000 | 480 | 2 |
500 | 942 | 2 | |||
200 | 2232 | 2 | |||
[41] | English | Twitter, blog, review, novel, and essay | 127 to 7078 | 192 to 400 | 2***** |
[42] | English | Works by Shakespeare, Christopher Marlowe, and Elizabeth Cary | N/A | 57 | 3 |
[43] | Persian | Books | N/A | 36 | 5 |
[44] | English | Books | N/A | 80 | 8 |
80 | 8 | ||||
80 | 8 | ||||
[45] | English | Books | N/A | 100 | 20 |
[46] | English | Books | 20,000 | 100 | 10 |
3 Functional language analysis
3.1 Problem statement
3.2 Feature extraction from language time series
3.3 Engineering functional language sequences
-
How should the text be tokenized?
-
How should the tokens be ordered?
-
How should the tokens be quantified?
-
$$\begin{aligned} Z_{\sharp}(\tau)= \vert \tau \vert \quad\text{(Sect. 4.2.1).} \end{aligned}$$(16)
-
Token frequency mapping, which was introduced in [60], considers the frequency of each token in the text.(17)
-
Token rank mapping was introduced in [23]. Using our notation, the mapping can be expressed as(18)
-
Token length distribution is ordered by token length \(\lambda=1,\ldots,N_{\sharp}\) with(19)
-
Token rank distribution is ordered by token frequency rank \(\nu=1,\ldots,N_{\text{b}}\) with \(\tau_{[\nu]}\) being the νth most frequent token with respect to \(Z_{\text{f}}(\tau)\) applied to alphabet \(\mathcal{A}\):(20)
3.4 Mapping methods overview
Requires data set dependent key-value mappings | Functional language sequence has fixed length | Words need to be stemmed | |
---|---|---|---|
Token Length Sequence | – | – | – |
Token Frequency Sequence | ✓ | – | ✓ |
Token Rank Sequence | ✓ | – | ✓ |
Token Length Distribution | – | ✓ | – |
Token Rank Distribution | ✓ | ✓ | ✓ |
4 Illustration and visualization of mapping methods
4.1 Case studies
4.1.1 The Spooky Books Data Set
train.csv
has 19,579 samples and corresponding labels. The file test.csv
contains 8392 samples but no labels. In this case study, we only used the training data set, because the class labels of the test data are unknown.Number of samples | Total number of tokens | Sample lengths (tokens) | ||
---|---|---|---|---|
Average | Standard deviation | |||
EAP | 7900 | 232,184 | 29.4 | 21.1 |
HPL | 5635 | 173,979 | 30.9 | 15.3 |
MWS | 6044 | 188,824 | 31.2 | 24.8 |
Overall | 19,579 | 594,987 | 30.4 | 20.9 |
4.1.2 The Federalist Papers Data Set
Number of sentences | Total number of tokens | Sample lengths (tokens) | ||
---|---|---|---|---|
Average | Standard deviation | |||
Hamilton | 3567 | 126,059 | 35.3 | 22.6 |
Madison | 1195 | 43,449 | 36.4 | 23.9 |
Jay | 225 | 9378 | 41.7 | 21.4 |
Overall | 4987 | 178,886 | 35.9 | 22.9 |
4.2 Some examples of functional language sequences
Note, that this sentence is in the middle of a dialog, which is continued in the original text, such that the left [“] and right quotation marks [”] had only been added for this quote, but are not present in the following analysis.“‘Let me go,’ he cried; ‘monster Ugly wretch You wish to eat me and tear me to pieces.”
4.2.1 Token length sequence (TLS)
word_tokenize
method was chosen for splitting the text samples into tokens [63]. For example, the sample text will be split into:
4.2.2 Token frequency sequence (TFS)
word_tokenize
method to split the sample. After which, all tokens split from the sample were stemmed using PorterStemmer. We did not convert capital letters into lowercase. As an example, the sample text can be split into the following units:
4.2.3 Token rank sequence (TRS)
4.2.4 Token length distribution (TLD)
4.2.5 Token rank distribution (TRD)
CountVectorizer().build_analyzer()
) to split the texts such that all words with two or more alphanumeric characters were selected from the texts, these words were then further stemmed by PorterStemmer
. We also adjusted the max_features
parameter of CountVectorizer to \(N_{\text{b}}=1000\) so that the top 1000 words with the highest number of occurrences were used as the x-axis of the functional language sequence.CountVectorizer
from the full Spooky Books Data Set is too large to be shown here. Hence, the first 50 words (in alphabetical order) is shown below instead:
4.3 Discrimination maps
StandardScalar
and used Principle Component Analysis (PCA) to reduce the dimension of the feature space to three. For each principle component, we discretized the values into 20 quantiles, such that their marginal distributions are uniform on the interval \([0,1]\) [65]. Combining the bins of the first principle component and the bins of the second principle component into a joint distribution, we computed for each of the 400 bins the percentages of samples for each author and every bin. This results to three \(20\times20\)-matrices (heat maps) of author-specific sample ratios, which were combined into a colour figure using the red layer for EAP, the green layer for MWS, and the blue layer for HPL (Fig. 4(a)). The same procedure was repeated for the first and third principle component (Fig. 4(b)), as well as the second and third principle component (Fig. 4(c)). The separation of the three primary colors demonstrates that the extracted features indeed capture differences between the three authors.The sample from HPL is located in bins that are typical for both EAP and HPL and, therefore, are coloured in shades of purple. However, the HPL example of Fig. 4(c) is located in a bin that is dominated by red indicating that the respective sentence resembles similarities with texts from EAP. The samples from MWS have a strong resemblance with EAP and HPL and are is located in reddish bins in Figs. 4(a), (c). However, a slightly stronger green shade is visible in 4(a), which identifies the sample as having an indistinguishable style.The rabble were in terror, for upon an evil tenement had fallen a red death beyond the foulest previous crime of the neighbourhood.
5 Evaluation: methodology
5.1 Evaluation procedure
DataFrame
in a format that can be used in tsfresh’s extract_features
function. We used tsfresh’s extract_features
method to extract all comprehensive features from all functional language sequences. The impute_function
parameter of extract_features
method was set to impute
, such that missing values (NaN
) were replaced by the median of the respective feature and infinite values (infs
) were replaced by minimal or maximal values depending on the sign. The default_fc_parameters
was set to ComprehensiveFCParameters
, such that 794 different time series features were extracted from each of the functional language sequences.5.2 Performance metric
StratifiedKFold
cross validator. The shuffle
option of the validator was set to true, and the random state was fixed to guarantee reproducible results. For each fold, the transformers and classifiers were trained using 90% of the data set, and predictions and evaluations were done on the remaining 10% of the data.5.3 The hybrid classifier
XGBClassifier
is configured via its Python API as follows: -
The objective parameter is set to
multi:softprobe
in order to enable XGBoost to perform multiclass classification. Therefore, the classifier uses a softmax objective and returns predicted probability for all class labels. -
The random number seed is set to 0 in order to guarantee reproducability of results.
6 Sentence-wise authorship attribution for the Spooky Books Data Set
6.1 NLP models for the Spooky Books Data Set
6.1.1 Bag-of-words models
CountVectorizer
to transform the text samples into a matrix of token counts. Each text sample is first split into a list of words using the default word analyser of CountVectorizer
; the stop-words (from NLTK’s corpus) are excluded from the list, and each word in the list is stemmed using PorterStemmer
. The word counts were calculated from these preprocessed words. Next, we used sklearn’s MultinomialNB
and GridSearchCV
implementations to fine tune the alpha
parameter with the scoring
parameter being set to neg_log_loss
. The evaluated values ranged from 0.1 to 1 with a step size of 0.1. The training of CountVectorizer
and MultinomialNB
, along with the parameter tuning on MultinomialNB were all done on the training data set in each fold.CountVectorizer
, we used a composite weight of term frequency and inverse document frequency (TF-IDF) as implemented by TfidfVectorizer
for generating the feature matrix, because it improved the prediction performance significantly. We used sklearn’s SVC
implementation and wrapped it using CalibratedClassifierCV
. Due to the high time cost of training the classifiers and making predictions, we did not perform parameter tuning. Apart from setting the random state every other parameter was kept as default.6.1.2 N-grams models
Process | Module | Parameters | Values |
---|---|---|---|
Represent | Character n-grams | N-gram range | Start = (1 to 5)–End = 5 |
Minimum term frequency | [0.05, 0.1, 0.5] | ||
Maximum term frequency | 1.0 (no limit) | ||
Vectorize | TF-IDF vectorizer | TF | Normal, sublinear |
IDF | Normal, smoothed | ||
Normalization | L1, L2 | ||
Count vectorizer | All set to default | ||
Scaling | MaxAbsScaler | All set to default | |
No scaler | N/A | ||
Classifier | Logistic regression | All set to default | |
Linear SVM | All set to default |
PorterStemmer
implementation from NLTK [63]. The start of the n-gram range was selected from 1 to 3 and the end of the range was fixed to 3. In addition, the minimum term frequency was fixed to 1, which means that all terms were used. The SVC classifier was wrapped by CalibratedClassifierCV
and the random states of both SVC and probability calibration were fixed. Process | Module | Parameters | Values |
---|---|---|---|
Preprocessing | Text preprocessing | Remove stopwords and stem words | |
None | N/A | ||
Represent | Word n-grams | N-gram range | Start = (1 to 3)–End = 3 |
Minimum term frequency | 1 (Use all terms) | ||
Maximum term frequency | 1.0 (no limit) | ||
Vectorize | TF-IDF vectorizer | TF | Normal, sublinear |
IDF | Normal, smoothed | ||
Normalization | L1, L2 | ||
Count vectorizer | All set to default | ||
Scaling | MaxAbsScaler | All set to default | |
No scaler | N/A | ||
Classifier | Logistic regression | All set to default | |
Linear SVM | All set to default |
6.2 The NLP baseline for the Spooky Books Data Set
6.3 Performance of sentence-wise authorship attribution for the Spooky Books Data Set
6.4 Stylometric features of the Spooky Books Data Set
6.4.1 Token frequency sequence: expected change between less frequent tokens
TFS
, which indicates that the feature has been computed from Token Frequency Sequences. The function
has been used for calculating the feature. The respective algorithm can be looked up from the online documentation of tsfresh.4 This feature quantifies the mean (
) absolute (
) difference between consecutive token frequencies, which are smaller than the 60th percentile (
) and larger than the 0th percentile (
). Both percentiles are computed for every TFS individually, such that the 0th percentile is equivalent to the minimum token frequency of the respective sequence. In the following descriptions, we refer to this feature as Quantile-Abs-Changes. This stylometric feature is quite interesting, because it combines a global characteristic (token frequency) with text sample specific characteristics (percentiles). A large feature value indicates that common words with about average token frequency are likely to appear next to uncommon words (small token frequency). A small feature value indicates that words from the same token frequency range are likely to appear next to words from the same range, if the most frequent tokens are excluded from this analysis.
I saw his eyes humid also as he took both my hands in his; and sitting down near me, he said: “This is a sad deed to which you would lead me, dearest friend, and your woe must indeed be deep that could fill you with these unhappy thoughts.
6.4.2 Token rank sequence: median
But it made men dream, and so they knew enough to keep away.
But while I endured punishment and pain in their defence with the spirit of an hero, I claimed as my reward their praise and obedience.
6.4.3 Token rank distribution: Fourier coefficients
6.4.4 Other stylometric features
7 Sentence-wise authorship attribution for Hamilton’s and Madison’s papers
sent_tokenize
method from NLTK [63]. The classification was done at the sentence level and log loss was used to evaluate the prediction performances for each fold.7.1 NLP models for the sentence-wise authorship attribution of Hamilton’s and Madison’s papers
sklearn
. To be specific, we used the CountVectorizer
for the text vectorization and the NearestCentroids
classifier as an equivalent to NSC.NearestCentroids
classifier was extended in order to provide probability estimates from discriminant scores as outlined by Tibshirani et al. [75, p. 108], while taking sample priors into account. Since discriminant scores might have extremely large values for test samples, we have modified the approach from Tibshirani et al. [75, p. 108] by performing a quantile transformation on the discriminant scores and fitting a logistic regression model on the transformed scores. The configuration of the NSC model is summarized in Table 7. Process | Module | Parameters | Values |
---|---|---|---|
Preprocessing | None | N/A | |
Raw (light preprocessing) | Remove words not appearing in all authors’ writings | ||
Preprocess (heavy preprocessing) | Remove words not appearing in all authors’ writings and with a relative frequency less than 0.05 percent | ||
Representation | Word n-grams | N-gram range | Start = 1–End = 2 |
Minimum term frequency | 1 (use all terms) | ||
Maximum term frequency | 1.0 (no limit) | ||
Vectorize | Count vectorizer | All set to default | |
Classifier | Nearest Shrunken Centroids (NSC) | Shrink threshold | Tuned by GridSearchCV using a 10-fold cross-validation |
7.2 NLP baseline for the sentence-wise authorship attribution problem of Hamilton’s and Madison’s papers
7.3 Performance of sentence-wise authorship attribution for Hamilton’s and Madison’s papers
-
Time Series Method Alone vs NLP NSC Method Alone (Fig. 23(a)),×
-
NLP MNB Method Alone vs Time Series Method Alone (Fig. 23(b)),
-
NLP MNB Method Alone vs NLP NSC Method Alone (Fig. 23(c)),
-
NLP MNB Combined with Time Series Features vs NLP MNB Method Alone (Fig. 23(d)),
-
and NLP NSC Combined with Time Series Features vs NLP NSC Method Alone (Fig. 23(e)).
7.4 Stylometric features of Hamilton’s and Madison’s papers
7.4.1 Token length sequence: mean and non-linearity
TLS__c3__lag_1
feature also captures some variances in non-linearity of the token length sequence time series.
TLS__mean
feature is easy to understand. It calculates the mean token length for every token length sequence. The histograms for the distributions of the feature from both Hamilton and Madison are shown in Fig. 25. Due to the imbalance in sample sizes between the two classes, both histograms are normalized.
TLS__c3__lag_1
deploys a non-linearity measurement proposed in [77]: TLS__c3__lag_1
feature, and \(z_{i,j}\) is the jth token length of sample i. With \(l = 1\), the c3 measurement is effectively calculating the expected value of the product of three consecutive points in the token length sequence. The distributions of the feature extracted from samples from both classes show that the Madison class tends to have higher c3 measurements, as shown in Fig. 27. The distributions are both normalized to compensate for the imbalance of sample sizes between the two classes and are log-transformed.
7.4.2 Token length distribution: intercept and slope of linear trend
TLD__linear_trend__attr_"intercept"
and TLD__linear_trend__attr_"slope"
features, respectively. From the plots in Fig. 30 and Fig. 31, it is clear that the intercepts calculated from length distributions of class Madison tend to be lower and the slopes tend to have higher values (less negative) than the corresponding values calculated from time series of class Hamilton. The differences in the average intercept and slope values of the fitted linear least-squares regression suggests that the token length distribution from the Madison class tend to be less “steep”, which further suggests that the lengths of the tokens in the sentences written by Madison tend to be less concentrated in lower values but more in higher values, compared to the ones from Hamilton’s writings. This finding is also supported by the fact that Madison’s writings tend to have a higher mean token length than Hamilton’s writings.