Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 3/2021

19.02.2021

Word-class embeddings for multiclass text classification

verfasst von: Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://​github.​com/​AlexMoreo/​word-class-embeddings.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Given a set of classes (a.k.a. a codeframe) \(\mathcal {C}=\{c_{1},\ldots ,c_{m}\}\), a classification problem is said to be multiclass if \(m>2\); it is said to be single-label if each item always belongs to exactly one class; it is said to be multilabel if each item can belong to any number (i.e., 0, 1, or more than 1) of classes in \(\mathcal {C}\).
 
2
fastText can consider not only unigrams but also n-grams and subwords as the surface forms of input.
 
3
Pointwise Mutual Information (PMI) is defined as \(\mathrm {PMI}(w_{i},c_{j})=\log \frac{\Pr (w_{i},c_{j})}{\Pr (w_{i})\Pr (c_{j})}\), where \(\Pr (w_{i},c_{j})\) is the joint probability of word \(w_{i}\) and context \(c_{j}\), and \(\Pr (w_{i})\) and \(\Pr (c_{j})\) are the marginal probabilities of the word and context, respectively. PPMI takes the positive part of PMI, i.e., \(\mathrm {PPMI}(w_{i},c_{j})=\max \{0,\mathrm {PMI}(w_{i},c_{j})\}\).
 
4
The compatibility between a label embedding matrix E and a document embedding h is defined to be proportional to \(\sigma (EU+b_u)\sigma (Vh+b_v)\), and this is in contrast to what is customarily done in previous related literature that relied instead on bi-linear models of the form EWh for the same purpose (\(U,b_u,V,b_v,W\) are learnable parameters).
 
5
Put it another way, L1 normalization fixes a “budget” of mass 1 to the score a term can deliver for any class, irrespectively of its prevalence in language or in the corpus.
 
6
It is worth recalling that the bag-of-words model tends to produce matrices that are highly sparse. Many software packages take advantage of this sparsity in order to compute matrix multiplication efficiently, at a cost that, in practice, falls far below the asymptotic bound \(O(|\mathcal {V}|nm)\). We discuss empirical computational complexity issues in Sect. 4.7.
 
7
PCA is based on (truncated) Singular Value Decomposition (SVD). The SVD of a matrix \(\mathbf {M}\) is a factorization of the form \(\mathbf {U}{\varvec{\Sigma }}\mathbf {V}^\top \), in which \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal matrices containing the left- and right-eigenvectors of \(\mathbf {M}\) as their columns, respectively, and \(\varvec{\Sigma }\) is a diagonal matrix containing the eigenvalues of \(\mathbf {M}\). That is, PCA is an alternative way of factoring \(\mathbf {M}\) w.r.t. Eq. 7. The dimensionality reduction is achieved by ordering the components by decreasing order of eigenvalues, and truncating the matrices. The optimal rank r approximation of \(\mathbf {M}\) is thus given by \(\mathbf {U}_{r}\varSigma _{r}\) which only accounts for the r largest eigenvalues and their corresponding r left-eigenvectors.
 
8
Note that the columns of \(\mathbf {Y}\) are binary, indicating the presence or absence of the label for each document. It is interesting to look at \(\mathbf {Y}\)’s binary columns as indicator functions that decide which elements from \(\mathbf {X}_{1}^\top \) rows contribute to the summation in the dot product.
 
9
Since we undertake a stochastic optimization, this actually applies to batches of data.
 
11
http://​qwone.​com/​~jason/​20Newsgroups/​. Note that this version of 20Newsgroups is indeed single-label: while a previous version contained a small set of document with more than one label (corresponding to posts that had been cross-posted to more than one newsgroup), that set is not present in this version we use.
 
12
While some previous papers [e.g., Tang et al. (2015)] have reported substantially higher scores for this dataset, it is worth noticing that we use a harder, more realistic version of the dataset than has been used in those papers. Following Moreo et al. (2020), in our version we remove all headers, footers, and quotes, since these fields contain words that are highly correlated with the target labels, thus making the classification task unrealistically easy; see http://​scikit-learn.​org/​stable/​datasets/​twenty_​newsgroups.​html for further details. Our results are indeed consistent with other papers following the same policy.
 
18
Note that these deep models are here not meant to be used as baselines, but to serve as vehicles on which to test WCEs. In other words, the actual baseline for any model equipped with WCEs is the same model not using WCEs.
 
20
We generate the validation set by randomly sampling 20% of the training set, with a maximum of 20,000 documents; the rest is taken to be the training set proper. We keep the training/validation split consistent across all methods.
 
21
Note that, consistently with (Cortes and Vapnik 1995; Morik et al. 1999), in this formulation we assume the class labels \(y_{k}\) to be in \(\{-1,+1\}\), while in Sect. 3 we had assumed them to be in \(\{0,1\}\); the difference is, of course, unproblematic.
 
22
In scikit-learn this is achieved by setting \(J_+=n/(mP)\) and \(J_-=n/(mN)\), and corresponds to setting the parameter class_weight to “balanced”.
 
23
Somehow surprisingly, though, several relevant related works where SVMs are used as baselines [see, e.g., (Grave et al. 2017; Jiang et al. 2018; Zhang et al. 2015)] do not report the details of how, if at all, they tune the SVM hyperparameters.
 
24
Using k-fold cross-validation (k-FCV) on the full set of labelled documents is a more expensive, but stronger, way of doing parameter optimization than using a single split between a training set and a validation set, because k-FCV performs k such splits. We here use k-FCV for SVMs and single-split optimization for all the other deep learning-based architectures because it is realistic to do so, i.e., because SVMs are computationally cheap enough for us to be able to afford k-FCV, while neural architectures are not.
 
26
The STW functions we have considered include chi-square, information gain, gain ratio, pointwise mutual information (Debole and Sebastiani 2003), ConfWeight (Soucy and Mineau 2005), and relevance frequency (Lan et al. 2009).
 
27
Given a word w, a codeframe \(\mathcal {C}=\{c_{1},\ldots ,c_m\}\), and a STW functions f that generates a list of scores \(S=(f(w,c_{1}),\ldots ,f(w,c_m))\), we consider the following aggregation functions: averaging \(\left( \frac{1}{m}\sum _{c\in \mathcal {C}}f(w,c)\right) \), averaging weighted by class prevalence \(\left( \frac{\sum _{c\in \mathcal {C}}f(w,c)p(c)}{\sum _{c\in \mathcal {C}}p(c)}\right) \) where p(c) is the prevalence of class c, and max-pooling \(\left( \max _{c\in \mathcal {C}}\{f(w,c)\}\right) \).
 
31
Note that by fastText we here mean its “supervised” mode, that is, fastText as a classifier. The set of embeddings that fastText produces when working in “unsupervised mode” are later used and discussed in Sect. 4.9, along with other sets of embeddings.
 
33
We modified the official implementation of https://​github.​com/​guoyinwang/​LEAM to use early-stop.
 
34
Though most traditional functions used for feature selection can only use presence/absence, other metrics exist that work with weighted scores, e.g., the Fisher score. In initial experiments not described in this paper we have indeed tried to use the Fisher score, but we have eventually given up, due to the fact that (a) its computation is very slow, and (b) the classification accuracy that we have observed is not much different from what can be obtained with the other functions mentioned above, and is often intermediate between the best and the worst recorded values.
 
37
We used the Huggingface’s implementation available at https://​github.​com/​huggingface/​transformers.
 
39
More often than not, BERT is used by fine-tuning the entire model to the task at hand. In this set of experiments we prefer to reproduce a simpler scenario, in which the practitioner simply uses pre-trained models as made available by the developers of BERT. Fine-tuning models such as BERT requires a considerable amount of computational power, which might not be at everyone’s reach. Experiments showcasing how a properly fine-tuned BERT works (with and without WCEs) on our datasets are illustrated in Sect. 4.4.
 
41
Another technique for solving this problem is Latent Semantic Imputation (Yao et al. 2019). This methods allows filling the missing representation in a vector space (in our case: in the space of WCEs) by analyzing the neighborhood of the word representation in another vector space (in our case: the space of unsupervised embeddings) via techniques inspired by manifold learning.
 
42
The fact that WCEs are not suitable for codeframes containing just a few classes is the reason why all the datasets we have chosen for our experiments are for classification by topic (CBT). While WCEs are not inherently about CBT, it is a matter of fact that large enough codeframes are mostly to be found in CBT (e.g., when classifying text according to domain-specific taxonomies/thesauri). Other classification tasks of a non-topical nature are often characterized by codeframes consisting of two or three classes; examples of this are classification by sensitivity (Sensitive versus NonSensitive) (Berardi et al. 2015), sentiment classification (Positive versus Neutral versus Negative) (Pang and Lee 2008), or classification by subjectivity (Subjective versus Objective) (Riloff et al. 2005).
 
43
While in this paper we have focused on classification, we should note that WCEs are straightforwardly applicable to regression tasks too. One reason why we exclusively concentrate on classification is that, in the realm of text, classification is a way more popular task than regression. In other words, there are many more applications of text classification than of text regression, which also means that there are fewer publicly available datasets for experimenting on text regression. A second reason why we have focused on classification is that most text regression tasks are not multiclass, i.e., there is a single class (or “concept”) of interest and the regressor must label a document with a real-valued score for that concept. “Single-class regression” is the regression equivalent of binary classification, and in Sect. 5.3 we have argued that WCEs are not suitable for binary classification; for the very same reasons they are not suitable to “single-class regression”. For all these reasons, in this paper we restrict our interest to (multiclass) classification.
 
Literatur
Zurück zum Zitat Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853MathSciNetMATH Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853MathSciNetMATH
Zurück zum Zitat Baker D, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR 1998), Melbourne, AU, pp 96–103. https://doi.org/10.1145/290941.290970 Baker D, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR 1998), Melbourne, AU, pp 96–103. https://​doi.​org/​10.​1145/​290941.​290970
Zurück zum Zitat Baldi P (2011) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of the ICML 2011 workshop on unsupervised and transfer learning, Bellevue, US, pp 37–49 Baldi P (2011) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of the ICML 2011 workshop on unsupervised and transfer learning, Bellevue, US, pp 37–49
Zurück zum Zitat Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, US, pp 238–247. https://doi.org/10.3115/v1/p14-1023 Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, US, pp 238–247. https://​doi.​org/​10.​3115/​v1/​p14-1023
Zurück zum Zitat Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH
Zurück zum Zitat Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH
Zurück zum Zitat Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F (2015) Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM 2015), Melbourne, AU, pp 1711–1714. https://doi.org/10.1145/2806416.2806597 Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F (2015) Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM 2015), Melbourne, AU, pp 1711–1714. https://​doi.​org/​10.​1145/​2806416.​2806597
Zurück zum Zitat Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 730–738 Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 730–738
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH
Zurück zum Zitat Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 4th conference on empirical methods in natural language processing (EMNLP 2006), Sydney, AU, pp 120–128. https://doi.org/10.3115/1610075.1610094 Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 4th conference on empirical methods in natural language processing (EMNLP 2006), Sydney, AU, pp 120–128. https://​doi.​org/​10.​3115/​1610075.​1610094
Zurück zum Zitat Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATH Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATH
Zurück zum Zitat Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297MATH Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297MATH
Zurück zum Zitat Daumé H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), Prague, CZ, pp 256–263 Daumé H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), Prague, CZ, pp 256–263
Zurück zum Zitat Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef
Zurück zum Zitat Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL 2019), Minneapolis, US, pp 4171–4186 Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL 2019), Minneapolis, US, pp 4171–4186
Zurück zum Zitat Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM 1998), Bethesda, US, pp 148–155. https://doi.org/10.1145/288627.288651 Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM 1998), Bethesda, US, pp 148–155. https://​doi.​org/​10.​1145/​288627.​288651
Zurück zum Zitat Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660MathSciNetMATH Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660MathSciNetMATH
Zurück zum Zitat Garneau N, Leboeuf J, Lamontagne L (2019) Contextual generation of word embeddings for out-of-vocabulary words in downstream tasks. In: Proceedings of the 32nd Canadian conference on artificial intelligence (Canadian AI), Kingston, CA, pp 563–569. https://doi.org/10.1007/978-3-030-18305-9_60 Garneau N, Leboeuf J, Lamontagne L (2019) Contextual generation of word embeddings for out-of-vocabulary words in downstream tasks. In: Proceedings of the 32nd Canadian conference on artificial intelligence (Canadian AI), Kingston, CA, pp 563–569. https://​doi.​org/​10.​1007/​978-3-030-18305-9_​60
Zurück zum Zitat Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS 2010), Chia Laguna, Italy, pp 249–256 Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS 2010), Chia Laguna, Italy, pp 249–256
Zurück zum Zitat Grave E, Mikolov T, Joulin A, Bojanowski P (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics (EACL 2017), Valencia, ES, pp 427–431. https://doi.org/10.18653/v1/e17-2068 Grave E, Mikolov T, Joulin A, Bojanowski P (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics (EACL 2017), Valencia, ES, pp 427–431. https://​doi.​org/​10.​18653/​v1/​e17-2068
Zurück zum Zitat Gupta S, Kanchinadam T, Conathan D, Fung G (2019) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67CrossRef Gupta S, Kanchinadam T, Conathan D, Fung G (2019) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67CrossRef
Zurück zum Zitat Hersh W, Buckley C, Leone T, Hickman D (1994) OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 192–201. https://doi.org/10.1007/978-1-4471-2099-5_20 Hersh W, Buckley C, Leone T, Hickman D (1994) OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 192–201. https://​doi.​org/​10.​1007/​978-1-4471-2099-5_​20
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
Zurück zum Zitat Hsu DJ, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Proceedings of the 23rd annual conference on neural information processing systems (NIPS 2009), Vancouver, CA, pp 772–780 Hsu DJ, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Proceedings of the 23rd annual conference on neural information processing systems (NIPS 2009), Vancouver, CA, pp 772–780
Zurück zum Zitat Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI 2016), New York, US, pp 2824–2830 Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI 2016), New York, US, pp 2824–2830
Zurück zum Zitat Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML 1998), Chemnitz, DE, pp 137–142. https://doi.org/10.1007/bfb0026683 Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML 1998), Chemnitz, DE, pp 137–142. https://​doi.​org/​10.​1007/​bfb0026683
Zurück zum Zitat Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th ACM conference on research and development in information retrieval (SIGIR 2001), New Orleans, US, pp 128–136. https://doi.org/10.1145/383952.383974 Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th ACM conference on research and development in information retrieval (SIGIR 2001), New Orleans, US, pp 128–136. https://​doi.​org/​10.​1145/​383952.​383974
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1746–1751 Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1746–1751
Zurück zum Zitat Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI 2016), Phoenix, US, pp 2741–2749 Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI 2016), Phoenix, US, pp 2741–2749
Zurück zum Zitat Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, US Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, US
Zurück zum Zitat Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015), Austin, US, pp 2267–2273 Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015), Austin, US, pp 2267–2273
Zurück zum Zitat Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef
Zurück zum Zitat Le HT, Cerisara C, Denis A (2018) Do convolutional networks need to be deep for text classification?. In: Proceedings of the AAAI 2018 workshop on affective content analysis, New Orleans, US, pp 29–36 Le HT, Cerisara C, Denis A (2018) Do convolutional networks need to be deep for text classification?. In: Proceedings of the AAAI 2018 workshop on affective content analysis, New Orleans, US, pp 29–36
Zurück zum Zitat LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRef LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRef
Zurück zum Zitat Lei X, Cai Y, Xu J, Ren D, Li Q, Leung HF (2019) Incorporating task-oriented representation in text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 401–415 Lei X, Cai Y, Xu J, Ren D, Li Q, Leung HF (2019) Incorporating task-oriented representation in text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 401–415
Zurück zum Zitat Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225CrossRef Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225CrossRef
Zurück zum Zitat Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014), Montreal, CA, pp 2177–2185 Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014), Montreal, CA, pp 2177–2185
Zurück zum Zitat Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR 1992), Kobenhavn, DK, pp 37–50 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR 1992), Kobenhavn, DK, pp 37–50
Zurück zum Zitat Lin J (2019) The neural hype and comparisons against weak baselines. SIGIR Forum 52(1):40–51CrossRef Lin J (2019) The neural hype and comparisons against weak baselines. SIGIR Forum 52(1):40–51CrossRef
Zurück zum Zitat Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), Lisbon, PT, pp 1412–1421 Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), Lisbon, PT, pp 1412–1421
Zurück zum Zitat McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 6294–6305 McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 6294–6305
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Workshop track proceedings of the 1st international conference on learning representations (ICLR 2013), Scottsdale, US Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Workshop track proceedings of the 1st international conference on learning representations (ICLR 2013), Scottsdale, US
Zurück zum Zitat Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the 11th international conference on language resources and evaluation (LREC 2018), Miyazaki, JP Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the 11th international conference on language resources and evaluation (LREC 2018), Miyazaki, JP
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 3111–3119
Zurück zum Zitat Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 2265–2273 Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 2265–2273
Zurück zum Zitat Moreo A, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55:131–163MathSciNetCrossRef Moreo A, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55:131–163MathSciNetCrossRef
Zurück zum Zitat Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th international conference on machine learning (ICML 1999), Bled, SL, pp 268–277 Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th international conference on machine learning (ICML 1999), Bled, SL, pp 268–277
Zurück zum Zitat Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1/2):1–135CrossRef Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1/2):1–135CrossRef
Zurück zum Zitat Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155CrossRef Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155CrossRef
Zurück zum Zitat Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1532–1543
Zurück zum Zitat Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (NAACL 2018), New Orleans, US, pp 2227–2237 Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (NAACL 2018), New Orleans, US, pp 2227–2237
Zurück zum Zitat Ren H, Zeng Z, Cai Y, Du Q, Li Q, Xie H (2019) A weighted word embedding model for text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 419–434 Ren H, Zeng Z, Cai Y, Du Q, Li Q, Xie H (2019) A weighted word embedding model for text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 419–434
Zurück zum Zitat Riloff E, Wiebe J, Phillips W (2005) Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 12th conference of the american association for artificial intelligence (AAAI 2005), Pittsburgh, US, pp 1106–1111 Riloff E, Wiebe J, Phillips W (2005) Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 12th conference of the american association for artificial intelligence (AAAI 2005), Pittsburgh, US, pp 1106–1111
Zurück zum Zitat Sahlgren M (2005) An introduction to random indexing. In: Proceedings of the TKE workshop on methods and applications of semantic indexing, Copenhagen, DK Sahlgren M (2005) An introduction to random indexing. In: Proceedings of the TKE workshop on methods and applications of semantic indexing, Copenhagen, DK
Zurück zum Zitat Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013), Seattle, US, pp 1631–1642 Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013), Seattle, US, pp 1631–1642
Zurück zum Zitat Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI 2005), Edinburgh, UK, pp 1130–1135 Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI 2005), Edinburgh, UK, pp 1130–1135
Zurück zum Zitat Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNetMATH Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNetMATH
Zurück zum Zitat Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genova, IT, pp 2142–2147 Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genova, IT, pp 2142–2147
Zurück zum Zitat Tang J, Qu M, Mei Q (2015) PTE: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM international conference on knowledge discovery and data mining (KDD 2015), Sydney, AU, pp 1165–1174 Tang J, Qu M, Mei Q (2015) PTE: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM international conference on knowledge discovery and data mining (KDD 2015), Sydney, AU, pp 1165–1174
Zurück zum Zitat van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605MATH van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605MATH
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 5998–6008
Zurück zum Zitat Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL 2018), Melbourne, AU, pp 2321–2331 Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL 2018), Melbourne, AU, pp 2321–2331
Zurück zum Zitat Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012), Jeju Island, KR, pp 90–94 Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012), Jeju Island, KR, pp 90–94
Zurück zum Zitat Yang Y, Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3):252–277CrossRef Yang Y, Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3):252–277CrossRef
Zurück zum Zitat Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019b) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd annual conference on neural information processing systems (NeurIPS 2019), Vancouver, CA, pp 5754–5764 Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019b) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd annual conference on neural information processing systems (NeurIPS 2019), Vancouver, CA, pp 5754–5764
Zurück zum Zitat Yang W, Lu K, Yang P, Lin J (2019a) Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd ACM conference on research and development in information retrieval (SIGIR 2019), Paris, FR, pp 1129–1132. https://doi.org/10.1145/3331184.3331340 Yang W, Lu K, Yang P, Lin J (2019a) Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd ACM conference on research and development in information retrieval (SIGIR 2019), Paris, FR, pp 1129–1132. https://​doi.​org/​10.​1145/​3331184.​3331340
Zurück zum Zitat Yu HF, Jain P, Kar P, Dhillon I (2014) Large-scale multi-label learning with missing labels. In: Proceedings of the 31st international conference on machine learning (ICML 2014), Beijing, CN, pp 593–601 Yu HF, Jain P, Kar P, Dhillon I (2014) Large-scale multi-label learning with missing labels. In: Proceedings of the 31st international conference on machine learning (ICML 2014), Beijing, CN, pp 593–601
Zurück zum Zitat Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 649–657 Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 649–657
Metadaten
Titel
Word-class embeddings for multiclass text classification
verfasst von
Alejandro Moreo
Andrea Esuli
Fabrizio Sebastiani
Publikationsdatum
19.02.2021
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 3/2021
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-020-00735-3

Weitere Artikel der Ausgabe 3/2021

Data Mining and Knowledge Discovery 3/2021 Zur Ausgabe

Premium Partner