Skip to main content

2018 | OriginalPaper | Buchkapitel

Enhancing Decision Boundary Setting for Binary Text Classification

verfasst von : Aisha Rashed Albqmi, Yuefeng Li, Yue Xu

Erschienen in: AI 2018: Advances in Artificial Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text classification is a task of assigning a set of text documents into predefined classes based on the classifier that learns from training samples; labelled or unlabeled. Binary text classifiers provide a way to separate related documents from a large dataset. However, the existing binary text classifiers are not grounded in reality due to the issue of overfitting. They try to find a clear boundary between relevant and irrelevant objects rather than understand the decision boundary. Normally, the decision boundary cannot be described as a clear boundary because of the numerous uncertainties in text documents. This paper attempts to address this issue by proposing an effective model based on sliding window technique (SW) and Support Vector Machine (SVM) to deal with the uncertain boundary and to improve the effectiveness of binary text classification. This model aims to set the decision boundary by dividing the training documents into three distinct regions (positive, boundary, and negative regions) to ensure the certainty of extracted knowledge to describe relevant information. The model then organizes training samples for the learning task to build a multiple SVMs based classifier. The experimental results using the standard dataset Reuters Corpus Volume 1 (RCV1) and TREC topics for text classification, show that the proposed model significantly outperforms six state-of-the-art baseline models in binary text classification.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2), 1–28 (2015) Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2), 1–28 (2015)
2.
Zurück zum Zitat Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, San Francisco, pp. 200–209. ACM (1999) Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, San Francisco, pp. 200–209. ACM (1999)
3.
Zurück zum Zitat John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: UAI 1995, Canada, pp. 338–345. ACM (1995) John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: UAI 1995, Canada, pp. 338–345. ACM (1995)
5.
Zurück zum Zitat Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRef Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRef
6.
7.
Zurück zum Zitat Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)MATH Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)MATH
8.
Zurück zum Zitat Zhang, L., Li, Y., Bijaksana, M. A.: Decreasing uncertainty for improvement of relevancy prediction. In: Proceeding of the Twelfth Australasian Data Mining Conference, AusDM 2014, Brisbane, pp. 157–162 (2014) Zhang, L., Li, Y., Bijaksana, M. A.: Decreasing uncertainty for improvement of relevancy prediction. In: Proceeding of the Twelfth Australasian Data Mining Conference, AusDM 2014, Brisbane, pp. 157–162 (2014)
9.
Zurück zum Zitat Li, Y., Zhang, L., Yue, X., Yiyu, Y., Raymond, L., Yutong, W.: Enhancing binary classification by modeling uncertain boundary in three-way decisions. IEEE Trans. Knowl. Data Eng. 29(7), 1438–1451 (2017)CrossRef Li, Y., Zhang, L., Yue, X., Yiyu, Y., Raymond, L., Yutong, W.: Enhancing binary classification by modeling uncertain boundary in three-way decisions. IEEE Trans. Knowl. Data Eng. 29(7), 1438–1451 (2017)CrossRef
10.
Zurück zum Zitat Wardaya, P.D.: Support vector machine as a binary classifier for automated object detection in remotely sensed data. In: IOP Conference Series: Earth and Environmental Science, vol. 18, no. 1. IOP Publishing (2014) Wardaya, P.D.: Support vector machine as a binary classifier for automated object detection in remotely sensed data. In: IOP Conference Series: Earth and Environmental Science, vol. 18, no. 1. IOP Publishing (2014)
12.
Zurück zum Zitat Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)CrossRef Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)CrossRef
13.
Zurück zum Zitat Shannon, M.: Forensic relative strength scoring: ASCII and entropy scoring. Int. J. Digit. Evid. 2(4), 1–19 (2004) Shannon, M.: Forensic relative strength scoring: ASCII and entropy scoring. Int. J. Digit. Evid. 2(4), 1–19 (2004)
14.
Zurück zum Zitat Lau, R.Y., Bruza, P.D., Song, D.: Towards a belief-revision-based adaptive and context-sensitive information retrieval system. ACM Trans. Inf. Syst. (TOIS) 26(2), 1–38 (2008)CrossRef Lau, R.Y., Bruza, P.D., Song, D.: Towards a belief-revision-based adaptive and context-sensitive information retrieval system. ACM Trans. Inf. Syst. (TOIS) 26(2), 1–38 (2008)CrossRef
15.
Zurück zum Zitat Bekkerman, R., Gavish, M.: High-precision phrase-based document classification on a modern scale. In: KDD 2011, San Diego, pp. 231–239. ACM (2011) Bekkerman, R., Gavish, M.: High-precision phrase-based document classification on a modern scale. In: KDD 2011, San Diego, pp. 231–239. ACM (2011)
16.
Zurück zum Zitat Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753–762. ACM, New York (2010) Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753–762. ACM, New York (2010)
17.
Zurück zum Zitat Fu, Z., Robles-Kelly, A., Zhou, J.: Mixing linear SVMs for nonlinear classification. IEEE Trans. Neural Netw. 21(12), 1963–1975 (2010)CrossRef Fu, Z., Robles-Kelly, A., Zhou, J.: Mixing linear SVMs for nonlinear classification. IEEE Trans. Neural Netw. 21(12), 1963–1975 (2010)CrossRef
18.
Zurück zum Zitat Rodriguez-Lujan, I., Cruz, C.S., Huerta, R.: Hierarchical linear support vector machine. Pattern Recogn. 45(12), 4414–4427 (2012)CrossRef Rodriguez-Lujan, I., Cruz, C.S., Huerta, R.: Hierarchical linear support vector machine. Pattern Recogn. 45(12), 4414–4427 (2012)CrossRef
19.
Zurück zum Zitat Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: FSKD 2010, Yantai, pp. 1502–1505. IEEE (2010) Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: FSKD 2010, Yantai, pp. 1502–1505. IEEE (2010)
20.
Zurück zum Zitat Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI 2006, Boston, pp. 763–768. ACM (2006) Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI 2006, Boston, pp. 763–768. ACM (2006)
21.
Zurück zum Zitat Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, Taipei (2003) Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, Taipei (2003)
22.
Zurück zum Zitat Du, L., Song, Q., Jia, X.: Detecting concept drift: an information entropy based method using an adaptive sliding window. Intell. Data Anal. 18(3), 337–364 (2014)CrossRef Du, L., Song, Q., Jia, X.: Detecting concept drift: an information entropy based method using an adaptive sliding window. Intell. Data Anal. 18(3), 337–364 (2014)CrossRef
23.
Zurück zum Zitat Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009) Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
24.
Zurück zum Zitat Ko, Y.J., Seo, J.Y.: Issues and empirical results for improving text classification. J. Comput. Sci. Eng. 5(2), 150–160 (2011)CrossRef Ko, Y.J., Seo, J.Y.: Issues and empirical results for improving text classification. J. Comput. Sci. Eng. 5(2), 150–160 (2011)CrossRef
25.
Zurück zum Zitat Hall, G.A.: Sliding window measurement for file type identification. Technical report, ManTech Security and Mission Assurance (2006) Hall, G.A.: Sliding window measurement for file type identification. Technical report, ManTech Security and Mission Assurance (2006)
26.
Zurück zum Zitat Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004) Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
27.
Zurück zum Zitat Joachims, T.: A support vector method for multivariate performance measures. In: ICML 2005, Germany, pp. 377–384. ACM (2005) Joachims, T.: A support vector method for multivariate performance measures. In: ICML 2005, Germany, pp. 377–384. ACM (2005)
28.
Zurück zum Zitat Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993) Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Metadaten
Titel
Enhancing Decision Boundary Setting for Binary Text Classification
verfasst von
Aisha Rashed Albqmi
Yuefeng Li
Yue Xu
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-03991-2_72

Premium Partner