Introduction
Financial statement fraud (FSF) or “book cooking” is a: “deliberate misrepresentation of financial statement data for the purpose of misleading the reader and creating a false impression of an organization’s financial strength” [2]. The deliberate misrepresentation, as outlined in accounting and auditing enforcement releases (AAER) filed by the securities exchange commission (SEC) include improper revenue recognition (the most common), manipulating expenses, capitalizing costs and overstating assets. This type of fraud causes the biggest loss: “a median loss of $1 million per case” [3]. The resultant loss of trust in capital markets and “confidence in the quality, reliability and transparency of financial information” [2] has disastrous implications for jobs, savings and investments. All can be wiped out. The financial industry’s meltdown in 2008 is a perfect example of what catastrophe follows when investors lose trust and confidence.If accounting scandals no longer dominate headlines as they did when Enron and WorldCom imploded in 2001–2002, that is not because they have vanished but because they have become routineThe Economist, Dec 13th, 2014 [1]
Deceptive linguistic cues | The effect in text | Author | Theory/method |
---|---|---|---|
Word quantity | Could be higher or lower in deceptive text. Generally, higher quantities of verbs, nouns, modifiers and group references | Zhou [14] | Interpersonal deception theory |
Pronoun use | First person singular pronouns less frequent, greater use of third person pronouns. This is known as distancing strategies (reducing ownership of a statement) | Newman et al. [13] Zhou [14] | Interpersonal deception theory |
Emotion words | Slightly more negativity, greater emotional expressiveness | Newman et al. [13] | Leakage theory |
Markers of cognitive complexity | Fewer exclusive terms (e.g. but, except), negations (e.g. no, never) and causation words (e.g. because, effect) and motion verbs—all require a deceiver to be more specific and precise. Repetitive phrasing and less diverse language is more marked in the language of liars. Also, more mention of cognitive operations such as thinking, admitting, hoping | Newman et al. [13] Hancock et al. [12] | Reality monitoring |
Modal verbs | Verbs such as would, should and could lower the level of commitment to facts | Hancock et al. [12] | Interpersonal deception theory |
Verbal non-immediacy | “Any indication through lexical choices, syntax and phraseology of separation, non-identity, attenuation of directness, or change in the intensity of interaction between the communicator and his referents”. Results in the use of more informal, non-immediate language | Zhou [14] | Interpersonal deception theory |
Uncertainty | “Impenetrable sentence structures (syntactic ambiguity) or use of evasive and ambiguous language that introduces uncertainty (semantic ambiguity). Modifiers, modal verbs (e.g. should, could) and generalizing or “allness” terms (e.g. “everybody”) increases uncertainty” | Zhou [14] | Interpersonal deception theory |
Half-truths and equivocations | Increased inclusion of adjectives and adverbs that qualify the meaning in statements. Sentences less cohesive and coherent thereby reducing readability | McNamara et al. [18] Bloomfield [29] | Management obfuscation hypothesis |
Passive voice | Increase in use, another distancing strategy—switch subject/object around | Duran et al. [50] | Interpersonal deception theory |
Relevance manipulations | Irrelevant details | Duran et al. [50] Bloomfield [29] | Management obfuscation hypothesis |
Sense-based words | Increase use of words such as see, touch, listen | Hancock et al. [12] | Reality monitoring |
-
The Coh–Metrix tool was used to extract 110 indices that measure how words are arranged and structured in discourse [18]. Together these indices provide a more robust measure of text readability [18]. Currently, a single measure such as the Gunning fog index has been the de facto standard in disclosure research in determining the readability of financial text [19].
-
Multi-word expressions are extracted (bigram and trigrams, known as n-grams) from the corpus. Both sets of n-grams would pick up greater context and thereby prize out collocations and differences in their use. These linguistic features are strong markers of style, thus would enable detection of any pattern differences.
-
Emotionally toned words are often touted as differentiating markers of linguistic style. The huge body of opinion mining and sentiment analysis research focuses on positive/negative polarities of words to gauge intent [20]. In the financial domain, Loughran and McDonald [21] discounted the use of general-purpose dictionaries to detect sentiment in financial text as often: “a liability is not a liability” [21] in this setting. They developed word lists for positive, negative, modal words (weak and strong), passive and uncertainty words. Their view is that these word lists are better suited to a financial setting. In this study, these word lists are used to pick up a frequency count of the words in the lists that are present in the corpus. A word list for forward-looking words was also used. This word list is integral to the WMatrix tool (described below) that is used to interrogate financial narratives. Forward-looking statements have been examined as markers of “informativeness” in financial text [22].
Financial Reporting
Language and Deception
-
Syntactical complexity that is readability of accounting narratives have been explored in the literature and used as a device to obfuscate bad news [27, 28]. This is in line with the incomplete revelation hypothesis which maintains that information that is difficult to extract is not impounded into share prices [29].
-
Sinclair [30] maintains that language is 70 % formulaic, and there is less variability in its use that would be garnered from the term popularised by Chomsky [31] that language is: “infinite use of finite means” [31]. This would be especially true when examining a particular genre of text such as financial text, where content, structure of discourse and linguistic style would be similar. Therefore, any difference in key constructs of language like collocations could be significant. In this study, multi-word expressions such as bigrams and trigrams are picked up from the corpus to aid in fraud detection.
-
The tone in financial text has been investigated to aid in determining company intentions and to predict stock price movement [28, 32]. The tone in text has primarily been gauged using general-purpose dictionaries used like Diction, Harvard Psychosociological dictionary [33]. Loughran and McDonald [21] find that these dictionaries substantially misclassify words in determining tone in financial text. They create a positive and a negative word list that are more appropriate. They also devised word list that relates to certainty, passive, modal strong and weak words. As can be seen from Table 1 that shows possible linguistic markers of deception, these words could aid in discriminating a fraud from a non-fraud firm.
Literature Review
-
Obfuscating bad news through reading ease or rhetorical manipulation. The motivation is that managers make the text less clear so that information is more costly to extract and poor performance will not be reflected immediately in market prices. Similarly, the use of rhetorical language deployed through the use of pronouns, passive voice, metaphor has been used to conceal poor firm performance. They argue that it is not: “what firms say” but rather “how they say it” [27] that leads to obfuscation. This is known as management obfuscation hypothesis. Most of the studies in this area use the Flesch–Kincaid or Gunning fog score to measure readability or manual content analysis to pick up rhetorical language constructs [28].
-
Emphasising good news through thematic manipulation, this is the “pollyanna principle” at work where managers emphasize good news and conceal bad news. This would result in greater positive overtones. In an management discussion and analysis (MDA) section of annual reports, this would manifest as: “presenting a false version of past performance, an unrealistic outlook for the future, misrepresenting the significance of key events omitting significant facts, providing misleading information about the current health of the company” [26]. To date, the tone in financial text has been examined using manual/semi-automated content analysis techniques based on positive/negative word counts [27].
Methodology
Data and Tools
Text Mining (tm) Package in R
Caret Package in R
WMatrix-Import Web Tool
Boruta Package in R
Feature Extraction
The Classification Task
-
x = is the firm narratives
-
y = {f, nf} of the possible classes
Decision Tree Classifiers
Random Forests and C5
Stochastic Gradient Boosting (SGB)
-
Computes β m —the weight of a given classifier.
-
Weights the training examples to compute the ith weak classifier h(α m ).
Support Vector Machine (SVM)
Boosted Logistic Regression
-
If h θ (x) ≥ 0.5 predict fraud. If h θ (x) < 0.5 predict non-fraud.
Results and Discussion
The Three Approaches
Kappa
Accuracy (ACC)
Sensitivity (True Positives)
Specificity (True Negatives)
No Information Rate (NIR)
P Value (ACC > NIR)
Pos Pred Value (PPV)
Neg Pred Value (NPV)
Balanced Accuracy
Approach One
-
“CN” are density scores (occurrence per 1000 words) for different types of connectives. These are important for the: “creation of cohesive links between ideas and clauses” [18].
-
“CR” takes measures related to referential cohesion, which refers to the overlap in content words between local sentences.
-
“DR” measures syntactic pattern density. All these measures are density scores for grammatical constructs such as noun phrases. This can adversely impact interpretability of text [18].
-
“LS” is Latent Semantic Analysis, and this provides a measure of semantic overlap between sentences.
-
“PC” these measures provide an: “indication of text-ease or difficulty that emerge from the linguistic characteristics of the text” [18].
-
“SM” the strength of mental representation evoked by the text that goes beyond the explicit words.
-
“SY” gives measures for how syntactically heavy a sentence is e.g. syntax in text is easier to process when there are shorter sentences.
-
“WR” density scores for grammatical constructs. These indices are then input into the classification algorithms. Results are shown in Table 4.
Bigrams | Trigrams |
---|---|
Accounted for | An adverse effect |
Acquisition of | And sale of |
And sale | At the time |
Annual report | Company’s ability to |
Be required | During the period |
Company in | Entered into a |
Continued to | For the year |
Designed for | In the event |
Due to | May be required |
Event that | Million at December |
Experience in | Million in cash |
For fiscal | Million of cash |
Group of | Not believe that |
In and | Of our common |
In compared | Our common stock |
Into a | Primarily as a |
Legal and | Primarily due to |
Market our | Provided by financing |
Necessary to | Pursuant to the |
Of approximately | Shares of common |
Our management | The acquisition of |
Our own | The company in |
Purchase price | The company’s ability |
The acquisition | The fiscal year |
The fiscal | The impact of |
The worlds | The results of |
To conduct | The year ended |
Year ended | Use of the |
Approach Two
Coh–Metrix indices | Description |
---|---|
CNCADC | Density score of adversative/contrastive connectives |
CNCAdd | Additive connectives incidence |
CNCNeg | Negative connectives incidence |
CNCTempx | Adversative and contrastive connectives incidence |
CRFANP1 | Anaphor overlap, adjacent sentences |
CRFNO1 | Avg num.(local) sentences that have noun overlap |
CRFNOa | Noun overlap of each sentence with every other sentence |
CRFSOa | Match of nouns and contents words with common lemma between sentences |
DRGERUND | Gerund density, incidence |
DRINF | Infinitive density, incidence |
DRPVAL | Density score of agentless passive voice form |
DRVP | Verb phrase density, incidence |
LSASS1d | LSA overlap, adjacent sentences, standard deviation |
PCCNCz | Text Easability PC word concreteness, z score |
PCCONNz | Text Easability PC connectivity, z score |
PCNARz | Text Easability PC narrativity, z score |
PCVERBz | Text Easability PC verb cohesion, z score |
RDFKGL | Flesch–Kincaid grade level |
SMCAUSlsa | LSA verb overlap |
SMCAUSwn | WordNet verb overlap |
SYNLE | Mean number of words before the main verb of the main clause in sentences |
SYNSTRUTa | Sentence syntax similarity, adjacent sentences, mean |
SYNSTRUTt | Sentence syntax similarity, all combinations, across paragraphs, mean |
WRDADJ | Adjective incidence |
WRDAOAc | Age of acquisition for content words, mean |
WRDFRQa | CELEX Log frequency for all words, mean |
WRDIMGc | Imagability for content words, mean |
WRDMEAc | Meaningfulness, Colorado norms, content words, mean |
WRDVERB | Verb incidence |
Approach Three
Discussion
Model | Kappa | Sensitivity | Specificity | ACC | 95 % CI | NIR |
P value [ACC > NIR] | Pos Pred value | Neg Pred value | Balanced accuracy |
---|---|---|---|---|---|---|---|---|---|---|
Coh–Metrix—peer set scenario
| ||||||||||
Stochastic gradient boosting |
0.63
|
0.68
|
0.94
|
0.88
|
0.80, 0.93
|
0.75
|
0.001
|
0.80
|
0.90
|
0.81
|
Boosted classification trees | 0.42 | 0.40 | 0.96 | 0.82 | 0.73, 0.89 | 0.75 | 0.06 | 0.76 | 0.82 | 0.68 |
Support vector machines | 0.47 | 0.40 | 0.98 | 0.84 | 0.75, 0.90 | 0.75 | 0.02 | 0.90 | 0.83 | 0.69 |
C5 | 0.56 | 0.56 | 0.94 | 0.85 | 0.46, 0.94 | 0.75 | 0.01 | 0.77 | 0.86 | 0.75 |
Random forest |
0.54
|
0.74
|
0.80
|
0.77
|
0.68, 0.85
|
0.75
|
1.141e−08
|
0.79
|
0.80
|
0.77
|
Coh–Metrix—matched-pair set scenario
| ||||||||||
Stochastic gradient boosting |
0.56
|
0.76
|
0.80
|
0.78
|
0.64, 0.88
|
0.5
|
4.511e−05
|
0.79
|
0.76
|
0.78
|
Boosted classification trees | 0.36 | 0.64 | 0.72 | 0.68 | 0.53, 0.80 | 0.5 | 0.007 | 0.69 | 0.66 | 0.68 |
Support vector machines | 0.68 | 1.0 | 0.68 | 0.84 | 0.70, 0.92 | 0.5 | 5.818e−07 | 0.75 | 1.00 | 0.84 |
C5 | 0.44 | 0.92 | 0.52 | 0.72 | 0.57, 0.83 | 0.5 | 0.001 | 0.65 | 0.86 | 0.72 |
Random forest |
0.68
|
0.88
|
0.80
|
0.84
|
0.70, 0.92
|
0.5
|
5.818e−07
|
0.81
|
0.86
|
0.84
|
Model | Kappa | Sensitivity | Specificity | ACC | 95 % CI | NIR |
P value [Acc > NIR] | Pos Pred value | Neg Pred value | Balanced accuracy |
---|---|---|---|---|---|---|---|---|---|---|
Bigrams—peer set scenario
| ||||||||||
Stochastic gradient boosting |
0.60
|
0.56
|
0.97
|
0.87
|
0.79, 0.92
|
0.75
|
0.002
|
0.87
|
0.87
|
0.76
|
Random forest | 0.58 | 0.56 | 0.96 | 0.86 | 0.77, 0.92 | 0.75 | 0.005 | 0.82 | 0.86 | 0.76 |
Support vector machines |
0.65
|
0.64
|
0.96
|
0.88
|
0.80, 0.93
|
0.75
|
0.001
|
0.84
|
0.89
|
0.80
|
Boosted logistic regression | 0.59 | 0.55 | 0.97 | 0.87 | 0.78, 0.83 | 0.77 | 0.01 | 0.84 | 0.88 | 0.76 |
C5 | 0.57 | 0.60 | 0.93 | 0.85 | 0.76, 0.91 | 0.75 | 0.01 | 0.75 | 0.87 | 0.76 |
Bigrams—matched-pair set scenario
| ||||||||||
Stochastic gradient boosting |
0.52
|
0.76
|
0.76
|
0.76
|
0.61, 0.86
|
0.5
|
0.00015
|
0.76
|
0.76
|
0.76
|
Random forest | 0.52 | 0.72 | 0.80 | 0.76 | 0.61, 0.86 | 0.5 | 0.00015 | 0.78 | 0.74 | 0.76 |
Support vector machines |
0.56
|
0.76
|
0.80
|
0.78
|
0.64, 0.88
|
0.5
|
4.511e−05
|
0.76
|
0.79
|
0.78
|
Boosted logistic regression | 0.52 | 0.77 | 0.75 | 0.76 | 0.59, 0.88 | 0.5 | 0.0023 | 0.73 | 0.78 | 0.76 |
C5 | 0.40 | 0.72 | 0.68 | 0.70 | 0.55, 0.82 | 0.5 | 0.0033 | 0.69 | 0.70 | 0.70 |
Model | Kappa | Sensitivity | Specificity | ACC | 95 % CI | NIR |
P value [Acc > NIR] | Pos Pred value | Neg Pred value | Balanced accuracy |
---|---|---|---|---|---|---|---|---|---|---|
Trigrams—peer set scenario
| ||||||||||
Stochastic gradient boosting | 0.65 | 0.76 | 0.90 | 0.87 | 0.79, 0.92 | 0.75 | 0.002 | 0.73 | 0.92 | 0.83 |
Random forest |
0.59
|
0.60
|
0.94
|
0.86
|
0.77, 0.92
|
0.75
|
0.005
|
0.78
|
0.87
|
0.77
|
Support vector machines |
0.61
|
0.60
|
0.96
|
0.87
|
0.79, 0.92
|
0.75
|
0.002
|
0.83
|
0.87
|
0.78
|
C5 | 0.62 | 0.64 | 0.94 | 0.87 | 0.79, 0.92 | 0.75 | 0.002 | 0.80 | 0.88 | 0.79 |
Boosted logistic regression | 0.54 | 0.59 | 0.92 | 0.83 | 0.73, 0.90 | 0.74 | 0.02 | 0.72 | 0.86 | 0.75 |
Trigrams—matched-pair set scenario
| ||||||||||
Stochastic gradient boosting | 0.44 | 0.72 | 0.72 | 0.72 | 0.57, 0.83 | 0.5 | 0.0013 | 0.72 | 0.72 | 0.72 |
Random forest |
0.68
|
0.96
|
0.72
|
0.84
|
0.70, 0.92
|
0.5
|
5.818e−07
|
0.77
|
0.94
|
0.84
|
Support vector machines |
0.60
|
0.88
|
0.72
|
0.80
|
0.66, 0.89
|
0.5
|
1.193e−05
|
0.75
|
0.85
|
0.80
|
C5 | 0.56 | 0.84 | 0.72 | 0.78 | 0.64, 0.88 | 0.5 | 4.511e−05 | 0.75 | 0.81 | 0.78 |
Boosted logistic regression | 0.40 | 0.85 | 0.53 | 0.69 | 0.54, 0.86 | 0.5 | 0.1045 | 0.73 | 0.70 | 0.72 |
Model | Kappa | Sensitivity | Specificity | ACC | 95 % CI | NIR |
P value [Acc > NIR] | Pos Pred value | Neg Pred value | Balanced accuracy |
---|---|---|---|---|---|---|---|---|---|---|
Financial word lists—peer set scenario
| ||||||||||
Stochastic gradient boosting | 0.60 | 0.56 | 0.97 | 0.87 | 0.79, 0.92 | 0.75 | 0.002 | 0.87 | 0.87 | 0.76 |
Boosted classification trees | 0.42 | 0.40 | 0.96 | 0.82 | 0.73, 0.89 | 0.75 | 0.06 | 0.76 | 0.82 | 0.68 |
Support vector machines |
0.65
|
0.64
|
0.96
|
0.88
|
0.80, 0.93
|
0.75
|
0.001
|
0.84
|
0.89
|
0.80
|
C5 |
0.57
|
0.60
|
0.93
|
0.85
|
0.76, 0.91
|
0.75
|
0.01
|
0.75
|
0.87
|
0.76
|
Boosted logistic regression | 0.59 | 0.55 | 0.97 | 0.87 | 0.78, 0.93 | 0.77 | 0.01 | 0.84 | 0.88 | 0.76 |
Financial word lists—matched-pair set scenario
| ||||||||||
Stochastic gradient boosting | 0.28 | 0.72 | 0.56 | 0.64 | 0.49, 0.77 | 0.5 | 0.03 | 0.62 | 0.66 | 0.64 |
Boosted classification trees | 0.36 | 0.80 | 0.56 | 0.68 | 0.53, 0.80 | 0.5 | 0.007 | 0.64 | 0.73 | 0.68 |
Support vector machines |
0.4
|
0.76
|
0.64
|
0.70
|
0.55, 0.82
|
0.5
|
0.0033
|
0.67
|
0.72
|
0.70
|
C5 |
0.4
|
0.72
|
0.68
|
0.70
|
0.55, 0.82
|
0.5
|
0.0033
|
0.69
|
0.70
|
0.70
|
Boosted logistic regression | 0.12 | 0.60 | 0.52 | 0.56 | 0.41, 0.70 | 0.5 | 0.23 | 0.55 | 0.56 | 0.56 |