Skip to main content
Log in

Fake or real? The computational detection of online deceptive text

  • Original Article
  • Published:
Journal of Marketing Analytics Aims and scope Submit manuscript

Abstract

Online repositories are providing business opportunities to gain feedback and opinions on products and services in the form of digital deposits. Such deposits are, in turn, capable of influencing the readers’ views and behaviours from the posting of misinformation intended to deceive or manipulate. Establishing the veracity of these digital deposits could thus bring key benefits to both online businesses and internet users. Although machine learning techniques are well established for classifying text in terms of their content, techniques to categorise them in terms of their veracity remain a challenge for the domain of feature set extraction and analysis. To date, text categorisation techniques for veracity have reported a wide and inconsistent range of accuracies between 57 and 90 per cent. This article evaluates the accuracy of detecting online deceptive text using a logistic regression classifier based on part of speech tags extracted from a corpus of known truthful and deceptive statements. An accuracy of 72 per cent is achieved by reducing 42 extracted part of speech tags to a feature vector of six using principle component analysis. The results compare favourably to other studies. Improvements are anticipated by training machine learning algorithms on more complex feature vectors by combining the key features identified in this study with others from disparate feature domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

References

  • Bond, C.F. and DePaulo, B.M. (2006) Accuracy of deception judgments. Personality and Social Psychology Review 10 (3): 214–234.

    Article  Google Scholar 

  • Caruana, R. and Niculescu-Mizil, A. (2006) An empirical comparison of supervised learning algorithms. In: W. Cohen and A. Moore (eds.) ICML ’06: Proceedings of the 23rd International Conference on Machine learning; 25–29 June, Pittsburgh, PA: ACM Press, pp. 161–168.

  • Cortes, C., Jackel, L.D., Solla, S.A., Vapnik, V. and Denker, J.S. (1994) Learning curves: Asymptotic values and rate of convergence. Advances in Neural Information Processing Systems 6: 327–334.

    Google Scholar 

  • De Maeyer, P. (2012) Impact of online consumer reviews on sales and price strategies: A review and directions for future research. Journal of Product & Brand Management 21 (2): 132–139.

    Article  Google Scholar 

  • Dietterich, T.G. and Kong, E.B. (1995) Technical Report: Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Oregon, Department of Computer Science, Oregon State University.

  • Flood, A. (2012) Sock puppetry and fake reviews: publish and be damned. The Guardian. http://www.guardian.co.uk/books/2012/sep/04/sock-puppetry-publish-be-damned, accessed 29 October 2014.

  • Forman, G. (2008) BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: J. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. Evans, A. Kolcz, K.-S. Choi and A. Chowdhury (eds.) Proceedings of the 17th ACM Conference on Information and Knowledge Management; 26–30 October, Napa Valley, CA: ACM Press, pp. 263–270.

  • Fuller, C. et al. (2006) Detecting deception in person of interest statements. In: Mehrotra, S. et al (eds.), Proceedings of the IEEE International Conference on Intelligence and Security Informatics; 23–24 May, San Diego, CA. Berlin: Springer, pp. 504–509.

  • Gokhman, S., Hancock, J., Prabhu, P., Ott, M. and Cardie, C. (2012) In search of a gold standard in studies of deception. In: E. Fitzpatrick, J. Bachenko and T. Fornaciari (eds.) Proceedings of the European Chapter for the Association for Computational Linguistics: Computational Approaches to Deception Detection Workshop; 23–27 April Avignon, France: ACL, pp. 23–30.

  • Hauch, V., Masip, J., Blandon-Gitlin, I. and Sporer, S.L. (2012) Linguistic cues to deception assessed by computer programs: a meta-analysis. In: E. Fitzpatrick, J. Bachenko and T. Fornaciari (eds.) Proceedings of the European Chapter for the Association for Computational Linguistics: Computational Approaches to Deception Detection Workshop; 23–27 April, Avignon, France: ACL, pp. 1–4.

  • Hayes, P.J., Anderson, P.M., Nirenburg, I.B. and Schmandt, L.M. (1990) TCS: a shell for content-based text categorization. In: Proceedings of CAIA-90, 6th IEEE Conference on Artificial Intelligence Applications; 5–9 May, Santa Barbara, US. Los Alamitos, CA: IEEE Computer Society Press, pp. 320–326.

  • Humphreys, S.L., Moffitt, K.C., Burns, M.B., Burgoon, J.K. and Felix, W.F. (2011) Identification of fraudulent financial statements using linguistic credibility analysis. Decision Support Systems 50 (3): 585–594.

    Article  Google Scholar 

  • Jindal, N. and Liu, B. (2008) Opinion spam and analysis. In: M. Najork, A. Broder and S. Chakrabati (eds.) Proceedings of the Conference on Web Search and Web Data Mining; 11–12 February, Stanford University, CA: ACM NY, pp. 219–230.

  • Joachims, T. (1998) Text categorization with support vector machines: learning with many relevant features. In: C. Nedellec and C. Rouveirol (eds.) Proceedings of the 10th European Conference on Machine Learning; 21–23 April, Chemnitz, Germany. London: Springer, pp. 137–142.

  • Kohavi, R. and John, G. (1996) Wrappers for feature subset selection. Artificial Intelligence 97 (1–2): 273–324.

    Google Scholar 

  • Langley, P. (1994) Selection of relevant features in machine learning. In: R. Greiner and D. Subramanian (eds.) Proceedings AAAI Fall Symposium on Relevance. Menlo Park, CA: AAAI Press, pp. 140–144.

  • Li, F., Huang, M., Yang, Y. and Zhu, X. (2011) Learning to identify review spam. In: Walsh, T. (ed.) Twenty Second International Joint Conference of Artificial Intelligence. Menlo Park, CA: AAAI Press, pp. 2488–2493.

  • Li, Y.H. and Jain, A.K. (1998) Classification of text documents. The Computer Journal 41 (8): 537–546.

    Article  Google Scholar 

  • Liu, B. (2010) Sentiment analysis: A multifaceted problem. IEEE Intelligent Systems 25 (3): 76–80.

    Google Scholar 

  • Liu, H. and Motoda, H. (2007) Computational Methods of Feature Selection, Florida: Chapman and Hall.

    Google Scholar 

  • Mann, S., Vrij, A. and Bull, R. (2002) Suspects, lies, and videotape: An analysis of authentic high-stake liars. Law and Human Behavior 26 (3): 365–376.

    Article  Google Scholar 

  • Mihalcea, R. and Strapparava, C. (2009) The lie detector: explorations in the automatic recognition of deceptive language. In: K. Su, J. Su, J. Wiebe and H. Li (eds.) Proceedings of the ACL-IJCNLP Conference Short Papers. Singapore: ACL, pp. 309–312.

  • Mitchell, T. (1997) Machine Learning, New York: McGraw-Hill.

    Google Scholar 

  • Newman, M.L., Pennebaker, J.W., Berry, D.S. and Richards, J.M. (2003) Lying words: Predicting deception from linguistic styles. Personality and Social Psychological Bulletin 29 (5): 665–675.

    Article  Google Scholar 

  • Ng, A.Y. (1997) Preventing “overfitting” of cross validation data. In: Fisher, D.H. (ed.) Proceedings of the Fourteenth International Conference on Machine Learning; 8–12 July. Nashville, TN: Morgan Kaufmann, pp. 245–253.

  • Ng, A.Y. (1998) On feature selection: learning with exponentially many irrelevant features as training examples. In: Shavlik, J.W. (ed.) Proceedings of the Fifteenth International Conference on Machine Learning; 24–27 July. Madison, WI: Morgan Kaufmann, pp. 404–412.

  • Ng, A.Y. (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Brodley, C.E. (ed.) Proceedings of the Twenty-first International Conference on Machine Learning; 4–8 July, Alberta. New York: ACM, p. 78.

  • Ng, A.Y. and Jordan, M. (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: T.G. Dietterich, S. Becker and Z. Ghahramani (eds.) Advances in Neural Information Processing Systems 14: Proceedings of the 2001 Conference. British Columbia: MIT Press, pp. 841–848.

  • Ott, M., Choi, Y., Cardie, C. and Hancock, J. (2011) Finding deceptive opinion spam by any stretch of the imagination. In: L. Dekang (ed.) Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; 19–24 June, Portland, OR: ACL, pp. 309–319.

  • Pepe, A., Mao, H. and Bollen, J. (2011) Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In: Proceedings of the 5th International AAAI Conference on Weblogs and Social Media; 17–21 July, Spain. Menlo Park, CA: AAAI Press, pp. 450–453.

  • Qin, T., Burgoon, J. and Nunamaker, Jr J.F. (2004) An exploratory study on promising cues in deception detection and application of decision tree. In: R. Sprague (ed.) Proceedings of the 37th Hawaii International Conference on System Sciences; 5–8 January, Hawaii: IEEE Computer Society, pp. 23–32.

  • Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys 34 (1): 1–47.

    Article  Google Scholar 

  • Sharma, A. and Paliwal, K. (2007) Fast principal component analysis using fixed-point algorithm. Pattern recognition Letters 28 (10): 1151–1155.

    Article  Google Scholar 

  • Trovillo, P.V. (1939) A history of lie detection. Journal of Criminal Law and Criminology 29 (6): 848–881.

    Google Scholar 

  • Trunk, G.V. (1979) A problem of dimensionality: A simple example. Pattern Analysis and Machine Intelligence, IEEE Transactions 1 (3): 306–307.

    Article  Google Scholar 

  • Vartapetiance, A. and Gillam, L. (2012) Does deception research yet offer a basis for deception detectives? In: E. Fitzpatrick, J. Bachenko and T. Fornaciari (eds.) Proceedings of the European Chapter for the Association for Computational Linguistics: Computational Approaches to Deception Detection Workshop; 23–27 April, Avignon, France: ACL, pp. 5–14.

  • Vrij, A., Edward, K., Roberts, K.P. and Bull, R. (2000) Detecting deceit via analysis of verbal and nonverbal behaviour. Journal of Nonverbal Behavior 24 (4): 239–264.

    Article  Google Scholar 

  • Wang, G., Xie, S., Liu, B. and Yu, P.S. (2011) Review graph based online store review spammer detection. In: D. Cook, J. Pei, W. Wang, R. Osmar and X. Wu (eds.) 11th IEEE International Conference on Data Mining. Vancouver, Canada: IEEE, pp. 1242–1247.

  • Wu, G., Greene, D., Smyth, B. and Cunningham, P. (2010) Technical Report: Distortion as a Validation Criterion in the Identification of Suspicious Reviews. Dublin, University College Dublin.

  • Xu, K., Liao, S.S., Li, J. and Song, Y. (2011) Mining comparative opinions from customer reviews for competitive intelligence. Decision Support Systems 50 (4): 743–754.

    Article  Google Scholar 

  • Yang, Y. (1999) An evaluation of statistical approaches to text categorization. Information Retrieval 1 (1–2): 69–90.

    Article  Google Scholar 

  • Zhang, T. and Oles, F.J. (2001) Text categorization based on regularized linear classification methods. Information Retrieval 4 (1): 5–31.

    Article  Google Scholar 

  • Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T. and Nunamaker, Jr J.F. (2004) A comparison of classification methods for predicting deception in computer-mediated communication. Journal of Management Information Systems 20 (4): 139–166.

    Google Scholar 

  • Zhou, L., Twitchell, D.P., Qin, T., Burgoon, J.K. and Nunamaker, Jr J.F. (2002) An exploratory study into deception detection in text-based computer-mediated communication. In: Sprague, Jr R.H. (ed.) System Sciences 2003: Proceedings of Thirty-Sixth Hawaii International Conference on System Sciences. IEEE, pp. 10–19.

Web References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leslie Ball.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ball, L., Elworthy, J. Fake or real? The computational detection of online deceptive text. J Market Anal 2, 187–201 (2014). https://doi.org/10.1057/jma.2014.15

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jma.2014.15

Keywords

Navigation