nach oben

Data Mining and Knowledge Discovery

Erschienen in:

19.02.2021

Word-class embeddings for multiclass text classification

verfasst von: Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.

Vorheriger Artikel Time series motifs discovery under DTW allows more robust discovery of conserved structure

Nächster Artikel Dataset2Vec: learning dataset meta-features

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Given a set of classes (a.k.a. a codeframe) \(\mathcal {C}=\{c_{1},\ldots ,c_{m}\}\), a classification problem is said to be multiclass if \(m>2\); it is said to be single-label if each item always belongs to exactly one class; it is said to be multilabel if each item can belong to any number (i.e., 0, 1, or more than 1) of classes in \(\mathcal {C}\).

fastText can consider not only unigrams but also n-grams and subwords as the surface forms of input.

Pointwise Mutual Information (PMI) is defined as \(\mathrm {PMI}(w_{i},c_{j})=\log \frac{\Pr (w_{i},c_{j})}{\Pr (w_{i})\Pr (c_{j})}\), where \(\Pr (w_{i},c_{j})\) is the joint probability of word \(w_{i}\) and context \(c_{j}\), and \(\Pr (w_{i})\) and \(\Pr (c_{j})\) are the marginal probabilities of the word and context, respectively. PPMI takes the positive part of PMI, i.e., \(\mathrm {PPMI}(w_{i},c_{j})=\max \{0,\mathrm {PMI}(w_{i},c_{j})\}\).

The compatibility between a label embedding matrix E and a document embedding h is defined to be proportional to \(\sigma (EU+b_u)\sigma (Vh+b_v)\), and this is in contrast to what is customarily done in previous related literature that relied instead on bi-linear models of the form EWh for the same purpose (\(U,b_u,V,b_v,W\) are learnable parameters).

Put it another way, L1 normalization fixes a “budget” of mass 1 to the score a term can deliver for any class, irrespectively of its prevalence in language or in the corpus.

It is worth recalling that the bag-of-words model tends to produce matrices that are highly sparse. Many software packages take advantage of this sparsity in order to compute matrix multiplication efficiently, at a cost that, in practice, falls far below the asymptotic bound \(O(|\mathcal {V}|nm)\). We discuss empirical computational complexity issues in Sect. 4.7.

PCA is based on (truncated) Singular Value Decomposition (SVD). The SVD of a matrix \(\mathbf {M}\) is a factorization of the form \(\mathbf {U}{\varvec{\Sigma }}\mathbf {V}^\top \), in which \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal matrices containing the left- and right-eigenvectors of \(\mathbf {M}\) as their columns, respectively, and \(\varvec{\Sigma }\) is a diagonal matrix containing the eigenvalues of \(\mathbf {M}\). That is, PCA is an alternative way of factoring \(\mathbf {M}\) w.r.t. Eq. 7. The dimensionality reduction is achieved by ordering the components by decreasing order of eigenvalues, and truncating the matrices. The optimal rank r approximation of \(\mathbf {M}\) is thus given by \(\mathbf {U}_{r}\varSigma _{r}\) which only accounts for the r largest eigenvalues and their corresponding r left-eigenvectors.

Note that the columns of \(\mathbf {Y}\) are binary, indicating the presence or absence of the label for each document. It is interesting to look at \(\mathbf {Y}\)’s binary columns as indicator functions that decide which elements from \(\mathbf {X}_{1}^\top \) rows contribute to the summation in the dot product.

Since we undertake a stochastic optimization, this actually applies to batches of data.

http://www.daviddlewis.com/resources/testcollections/reuters21578/.

http://qwone.com/~jason/20Newsgroups/. Note that this version of 20Newsgroups is indeed single-label: while a previous version contained a small set of document with more than one label (corresponding to posts that had been cross-posted to more than one newsgroup), that set is not present in this version we use.

While some previous papers [e.g., Tang et al. (2015)] have reported substantially higher scores for this dataset, it is worth noticing that we use a harder, more realistic version of the dataset than has been used in those papers. Following Moreo et al. (2020), in our version we remove all headers, footers, and quotes, since these fields contain words that are highly correlated with the target labels, thus making the classification task unrealistically easy; see http://scikit-learn.org/stable/datasets/twenty_newsgroups.html for further details. Our results are indeed consistent with other papers following the same policy.

http://disi.unitn.it/moschitti/corpora.htm.

Available from http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.

https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis.

https://www.wipo.int/classifications/ipc/en/ITsupport/Categorization/dataset/.

http://scikit-learn.org/.

Note that these deep models are here not meant to be used as baselines, but to serve as vehicles on which to test WCEs. In other words, the actual baseline for any model equipped with WCEs is the same model not using WCEs.

http://nlp.stanford.edu/data/glove.840B.300d.zip.

We generate the validation set by randomly sampling 20% of the training set, with a maximum of 20,000 documents; the rest is taken to be the training set proper. We keep the training/validation split consistent across all methods.

Note that, consistently with (Cortes and Vapnik 1995; Morik et al. 1999), in this formulation we assume the class labels \(y_{k}\) to be in \(\{-1,+1\}\), while in Sect. 3 we had assumed them to be in \(\{0,1\}\); the difference is, of course, unproblematic.

In scikit-learn this is achieved by setting \(J_+=n/(mP)\) and \(J_-=n/(mN)\), and corresponds to setting the parameter class_weight to “balanced”.

Somehow surprisingly, though, several relevant related works where SVMs are used as baselines [see, e.g., (Grave et al. 2017; Jiang et al. 2018; Zhang et al. 2015)] do not report the details of how, if at all, they tune the SVM hyperparameters.

Using k-fold cross-validation (k-FCV) on the full set of labelled documents is a more expensive, but stronger, way of doing parameter optimization than using a single split between a training set and a validation set, because k-FCV performs k such splits. We here use k-FCV for SVMs and single-split optimization for all the other deep learning-based architectures because it is realistic to do so, i.e., because SVMs are computationally cheap enough for us to be able to afford k-FCV, while neural architectures are not.

This implementation relies on liblinear. See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for further details.

The STW functions we have considered include chi-square, information gain, gain ratio, pointwise mutual information (Debole and Sebastiani 2003), ConfWeight (Soucy and Mineau 2005), and relevance frequency (Lan et al. 2009).

Given a word w, a codeframe \(\mathcal {C}=\{c_{1},\ldots ,c_m\}\), and a STW functions f that generates a list of scores \(S=(f(w,c_{1}),\ldots ,f(w,c_m))\), we consider the following aggregation functions: averaging \(\left( \frac{1}{m}\sum _{c\in \mathcal {C}}f(w,c)\right) \), averaging weighted by class prevalence \(\left( \frac{\sum _{c\in \mathcal {C}}f(w,c)p(c)}{\sum _{c\in \mathcal {C}}p(c)}\right) \) where p(c) is the prevalence of class c, and max-pooling \(\left( \max _{c\in \mathcal {C}}\{f(w,c)\}\right) \).

https://github.com/ThilinaRajapakse/simpletransformers.

https://huggingface.co/bert-base-uncased.

https://github.com/guoyinwang/LEAM.

Note that by fastText we here mean its “supervised” mode, that is, fastText as a classifier. The set of embeddings that fastText produces when working in “unsupervised mode” are later used and discussed in Sect. 4.9, along with other sets of embeddings.

https://fasttext.cc/.

We modified the official implementation of https://github.com/guoyinwang/LEAM to use early-stop.

Though most traditional functions used for feature selection can only use presence/absence, other metrics exist that work with weighted scores, e.g., the Fisher score. In initial experiments not described in this paper we have indeed tried to use the Fisher score, but we have eventually given up, due to the fact that (a) its computation is very slow, and (b) the classification accuracy that we have observed is not much different from what can be obtained with the other functions mentioned above, and is often intermediate between the best and the worst recorded values.

Available at https://code.google.com/archive/p/word2vec/.

Available at https://fasttext.cc/docs/en/english-vectors.html.

We used the Huggingface’s implementation available at https://github.com/huggingface/transformers.

See https://fasttext.cc/docs/en/unsupervised-tutorial.html.

More often than not, BERT is used by fine-tuning the entire model to the task at hand. In this set of experiments we prefer to reproduce a simpler scenario, in which the practitioner simply uses pre-trained models as made available by the developers of BERT. Fine-tuning models such as BERT requires a considerable amount of computational power, which might not be at everyone’s reach. Experiments showcasing how a properly fine-tuned BERT works (with and without WCEs) on our datasets are illustrated in Sect. 4.4.

https://projector.tensorflow.org/.

Another technique for solving this problem is Latent Semantic Imputation (Yao et al. 2019). This methods allows filling the missing representation in a vector space (in our case: in the space of WCEs) by analyzing the neighborhood of the word representation in another vector space (in our case: the space of unsupervised embeddings) via techniques inspired by manifold learning.

The fact that WCEs are not suitable for codeframes containing just a few classes is the reason why all the datasets we have chosen for our experiments are for classification by topic (CBT). While WCEs are not inherently about CBT, it is a matter of fact that large enough codeframes are mostly to be found in CBT (e.g., when classifying text according to domain-specific taxonomies/thesauri). Other classification tasks of a non-topical nature are often characterized by codeframes consisting of two or three classes; examples of this are classification by sensitivity (Sensitive versus NonSensitive) (Berardi et al. 2015), sentiment classification (Positive versus Neutral versus Negative) (Pang and Lee 2008), or classification by subjectivity (Subjective versus Objective) (Riloff et al. 2005).

While in this paper we have focused on classification, we should note that WCEs are straightforwardly applicable to regression tasks too. One reason why we exclusively concentrate on classification is that, in the realm of text, classification is a way more popular task than regression. In other words, there are many more applications of text classification than of text regression, which also means that there are fewer publicly available datasets for experimenting on text regression. A second reason why we have focused on classification is that most text regression tasks are not multiclass, i.e., there is a single class (or “concept”) of interest and the regressor must label a document with a real-valued score for that concept. “Single-class regression” is the regression equivalent of binary classification, and in Sect. 5.3 we have argued that WCEs are not suitable for binary classification; for the very same reasons they are not suitable to “single-class regression”. For all these reasons, in this paper we restrict our interest to (multiclass) classification.

Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853MathSciNetMATH

Baker D, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR 1998), Melbourne, AU, pp 96–103. https://doi.org/10.1145/290941.290970

Baldi P (2011) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of the ICML 2011 workshop on unsupervised and transfer learning, Bellevue, US, pp 37–49

Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, US, pp 238–247. https://doi.org/10.3115/v1/p14-1023

Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208MATH

Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH

Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F (2015) Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM 2015), Melbourne, AU, pp 1711–1714. https://doi.org/10.1145/2806416.2806597

Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 730–738

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATH

Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 4th conference on empirical methods in natural language processing (EMNLP 2006), Sydney, AU, pp 120–128. https://doi.org/10.3115/1610075.1610094

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051CrossRef

Bullinaria JA, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526. https://doi.org/10.3758/bf03193020CrossRef

Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63:743–788. https://doi.org/10.1613/jair.1.11259MathSciNetCrossRefMATH

Caruana R (1993) Multitask learning: A knowledge-based source of inductive bias. In: Proceedings of the 10th international conference on machine learning (ICML 1993), Amherst, US, pp 41–48. https://doi.org/10.1016/b978-1-55860-307-3.50012-5

Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537MATH

Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297MATH

Daumé H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), Prague, CZ, pp 256–263

Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM symposium on applied computing (SAC 2003), Melbourne, US, pp 784–788. https://doi.org/10.1145/952532.952688

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef

Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL 2019), Minneapolis, US, pp 4171–4186

Dong Y, Liu P, Zhu Z, Wang Q, Zhang Q (2020) A fusion model-based label embedding and self-interaction attention for text classification. IEEE Access 8:30548–30559. https://doi.org/10.1109/access.2019.2954985CrossRef

Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM 1998), Bethesda, US, pp 148–155. https://doi.org/10.1145/288627.288651

Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660MathSciNetMATH

Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning (ICML 2004), Banff, CA, pp 38–45. https://doi.org/10.1145/1015330.1015356

Garneau N, Leboeuf J, Lamontagne L (2019) Contextual generation of word embeddings for out-of-vocabulary words in downstream tasks. In: Proceedings of the 32nd Canadian conference on artificial intelligence (Canadian AI), Kingston, CA, pp 563–569. https://doi.org/10.1007/978-3-030-18305-9_60

Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS 2010), Chia Laguna, Italy, pp 249–256

González P, Castaño A, Chawla NV, del Coz JJ (2017) A review on quantification learning. ACM Comput Surv 50(5):74:1–74:40. https://doi.org/10.1145/3117807CrossRef

Grave E, Mikolov T, Joulin A, Bojanowski P (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics (EACL 2017), Valencia, ES, pp 427–431. https://doi.org/10.18653/v1/e17-2068

Gupta S, Kanchinadam T, Conathan D, Fung G (2019) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67CrossRef

Harris ZS (1954) Distributional structure. Word 10(2–3):146–162. https://doi.org/10.1007/978-94-017-6059-1_36CrossRef

Hersh W, Buckley C, Leone T, Hickman D (1994) OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 192–201. https://doi.org/10.1007/978-1-4471-2099-5_20

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

Hsu DJ, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Proceedings of the 23rd annual conference on neural information processing systems (NIPS 2009), Vancouver, CA, pp 772–780

Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70. https://doi.org/10.1007/s00521-016-2401-xCrossRef

Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI 2016), New York, US, pp 2824–2830

Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML 1998), Chemnitz, DE, pp 137–142. https://doi.org/10.1007/bfb0026683

Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th ACM conference on research and development in information retrieval (SIGIR 2001), New Orleans, US, pp 128–136. https://doi.org/10.1145/383952.383974

Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1746–1751

Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI 2016), Phoenix, US, pp 2741–2749

Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, US

Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015), Austin, US, pp 2267–2273

Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRef

Le HT, Cerisara C, Denis A (2018) Do convolutional networks need to be deep for text classification?. In: Proceedings of the AAAI 2018 workshop on affective content analysis, New Orleans, US, pp 29–36

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRef

Lei X, Cai Y, Xu J, Ren D, Li Q, Leung HF (2019) Incorporating task-oriented representation in text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 401–415

Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225CrossRef

Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014), Montreal, CA, pp 2177–2185

Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR 1992), Kobenhavn, DK, pp 37–50

Lin J (2019) The neural hype and comparisons against weak baselines. SIGIR Forum 52(1):40–51CrossRef

Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), Lisbon, PT, pp 1412–1421

McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 6294–6305

Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Workshop track proceedings of the 1st international conference on learning representations (ICLR 2013), Scottsdale, US

Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the 11th international conference on language resources and evaluation (LREC 2018), Miyazaki, JP

Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 3111–3119

Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 2265–2273

Moreo A, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55:131–163MathSciNetCrossRef

Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316. https://doi.org/10.1109/TKDE.2018.2883446CrossRefMATH

Moreo A, Pedrotti A, Sebastiani F (2021) Heterogeneous document embeddings for cross-lingual text classification. In: Proceedings of the 36th ACM symposium on applied computing (SAC 2021), Gwangju, KR. https://doi.org/10.1145/3412841.3442093(forthcoming)

Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th international conference on machine learning (ICML 1999), Bled, SL, pp 268–277

Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1/2):1–135CrossRef

Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155CrossRef

Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1532–1543

Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (NAACL 2018), New Orleans, US, pp 2227–2237

Ren H, Zeng Z, Cai Y, Du Q, Li Q, Xie H (2019) A weighted word embedding model for text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 419–434

Riloff E, Wiebe J, Phillips W (2005) Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 12th conference of the american association for artificial intelligence (AAAI 2005), Pittsburgh, US, pp 1106–1111

Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536. https://doi.org/10.1038/323533a0CrossRefMATH

Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1):21–41. https://doi.org/10.1162/089976602753284446CrossRefMATH

Sahlgren M (2005) An introduction to random indexing. In: Proceedings of the TKE workshop on methods and applications of semantic indexing, Copenhagen, DK

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013), Seattle, US, pp 1631–1642

Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI 2005), Edinburgh, UK, pp 1130–1135

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNetMATH

Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genova, IT, pp 2142–2147

Tang J, Qu M, Mei Q (2015) PTE: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM international conference on knowledge discovery and data mining (KDD 2015), Sydney, AU, pp 1165–1174

van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605MATH

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 5998–6008

Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL 2018), Melbourne, AU, pp 2321–2331

Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012), Jeju Island, KR, pp 90–94

Yang Y, Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3):252–277CrossRef

Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019b) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd annual conference on neural information processing systems (NeurIPS 2019), Vancouver, CA, pp 5754–5764

Yang W, Lu K, Yang P, Lin J (2019a) Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd ACM conference on research and development in information retrieval (SIGIR 2019), Paris, FR, pp 1129–1132. https://doi.org/10.1145/3331184.3331340

Yao S, Yu D, Xiao K (2019) Enhancing domain word embedding via latent semantic imputation. In: Proceedings of the 25th ACM conference on knowledge discovery and data mining (KDD 2019), Anchorage, US, pp 557–565. https://doi.org/10.1145/3292500.3330926

Yu HF, Jain P, Kar P, Dhillon I (2014) Large-scale multi-label learning with missing labels. In: Proceedings of the 31st international conference on machine learning (ICML 2014), Beijing, CN, pp 593–601

Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253. https://doi.org/10.1002/widm.1253CrossRef

Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 649–657

Titel: Word-class embeddings for multiclass text classification
verfasst von: Alejandro Moreo
Andrea Esuli
Fabrizio Sebastiani
Publikationsdatum: 19.02.2021
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 3/2021
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-020-00735-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2021

FuseRec: fusing user and item homophily modeling with temporal recommender systems

Mining communities and their descriptions on attributed graphs: a survey

Detecting virtual concept drift of regressors without ground truth values

Recurring concept memory management in data streams: exploiting data stream concept evolution to improve performance and transparency

ForestDSH: a universal hash design for discrete probability distributions

Multi-label learning with missing and completely unobserved labels

Premium Partner