Skip to main content

Open Access 28.01.2025 | Original Article

Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

verfasst von: Dominic B. Dayta, Erniel B. Barrios

Erschienen in: Annals of Data Science

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The fields of natural language processing and information retrieval saw a productive past two decades due largely to the emergence and worldwide adoption of two modern technologies: large-scale document indexing and storage facilities, of which perhaps the two most prominent brands are JSTOR and Google Books, and social networking sites that allow individual users to create and distribute various types of content, a considerable fraction of which exist in the form of texts (status updates, blog posts, and tweets). All these have led to a relentless growth in information-rich but unstructured collections of text data––referred to as corpora in natural language terminology––in terms of volume, velocity, and frequency such that manual approaches to document indexing and classification are quickly becoming obsolete.
Outside the context of online archives, methods that enable automated classification and analysis of voluminous corpora would prove to be valuable technology. It has been applied to legal research [1] and for analyzing patterns behind railroad accidents [2]. In the commercial space, companies can take advantage of thousands of posts being contributed by users on a daily basis about their products and services on social media and review aggregator websites like Yelp and TripAdvisor.
Among the core functions of Customer Relations Management (CRM) departments in customer-facing industries is capturing what they call the Voice of the Customers (VOC). VOC refers to feedback, self-reported by the customers in the form of verbatim complaints, comments, inquiries, and the likes sent in via one or more points of capture. Presently, the standard approach that industries have taken towards the capture and analysis of VOC is via the employment of a Business Process Outsourcing (BPO) partner, that would, in turn, deploy customer care representatives to receive and process feedback. Representatives are trained in handling customers and are oriented towards categorizing feedback according to subject. Through this arrangement, previously unstructured data from call transcripts, e-mails, SMS, social media posts, and other possible venues are transformed into structured (i.e., tabular) summaries which are then sent to the client company for resolution, actions, and further analysis.
The proliferation of social networking and micro-blogging services on the Internet has given consumers an inexhaustible variety of platforms through which they may voice out satisfaction or dissatisfaction towards these companies’ products and services. All these have led to a relentless growth for VOC in terms of volume, velocity, and frequency such that manual approaches to feedback management are quickly becoming inefficient. Successful formulation of a new methodology for automated complaints classification would not only impact businesses directly concerned, but also their outsourced service providers as this would ease the growing tedium of manual feedback capture systems and allow for better, more strategic allocation and management of manpower.
This need for automation is hardly novel in the literature. Hotel reviews on certain travel websites have been analyzed to the effect of identifying driving factors to customer satisfaction [3]. This was accomplished by grouping together known words appearing in the reviews under general tokens that identify thematic similarities between customers’ complaints and commendations. Other, more sophisticated approaches involve the use of fuzzy algorithms [4].
Both approaches can be seen as forerunners to the use topic modeling for analyzing VOC, wherein the topic structures were defined a priori by the researchers (or, in the latter case, through fuzzy logic). In true topic modeling, these structures are discovered rather than pre-determined by the analyst, and this discovery provides the objective of the algorithm. In [3] Latent Semantic Analysis (LSA), is used to extract linguistic characteristics from customer complaints, and these characteristics were later used as features in a classification model.
The method of Singular Value Decomposition (SVD) is used to discover underlying “semantic structures” defined by word co-occurrences, [5]. This method, like the others that would succeed it, depended on a specific representation of the corpora into a matrix form that is much more suited for statistical analysis. By representing each document as a vector defined by its frequencies across a set of unique words, the document vectors together formed a matrix for the entire corpus, which could be subjected to factorization via SVD. This would be refined with Probabilistic Latent Semantic Analysis or PLSA [6] which provided a more interpretable framework by defining the topics as probability distributions over words, and replacing SVD with a more formal estimation procedure via the Expectation–Maximization (EM) algorithm. Later, Latent Dirichlet Allocation(LDA) addressed some of PLSA’s shortcomings by giving it a Bayesian flavor [7]. Nevertheless, LSA has arguably set the direction of much of the research regarding topic modeling.
This paper proposes a two-stage semiparametric topic modeling (SemiparTM) approach. The first stage is based on LSA, replacing the use of SVD with nonnegative matrix factorization with the goal of addressing LSA’s lack of clear interpretations towards topic distributions. A second stage that makes use of semiparametric regression is added to address criticisms that have also been levelled against PLSA and LDA. Specifically, the semiparametric method is aimed at accurately reproducing latent topic structures in the corpus, providing a generative model that yields better prediction of topics in new documents that have yet to enter the corpus.
The rest of this paper is organized as follows: in the next section, we formalize the vector space representation for converting unstructured text documents into a numerical format on which we can characterize statistically. This section also covers some basics of notations that will be useful for discussing the formulation of SemiparTM in the following section. Section 4 outlines the simulation study by which we intend to evaluate and compare the performance of SemiparTM against LSA, PLSA, and LDA in discovering latent topic structures, the results of which are then presented in Sect. 5. Application of the proposed method on actual consumer feedback is presented in Sect. 6. Finally, we close our discussion with some concluding remarks and practical recommendations for using SemiparTM on real-world corpora.

2 Topic Modeling

Topic modeling requires a way of translating unstructured texts into quantities on which statistical models can be trained. One popular approach used in the legacy methods as well as a number of other succeeding methods is the Vector Space Model, or the Bag of Words approach. In this approach, documents are converted into vectors whose components correspond to scores for unique words existing in the entire corpus. Note that score is used as a general term towards a variety of procedures to representing, in numerical form, the level of occurrence, importance, or otherwise of words in the documents of a corpus.
Suppose a corpus is composed of \(D\) documents which together make use of a total of \(W\) unique words. A document \({d}_{i}\) in the corpus may be referred to as the \(W\times 1\) vector of scores, \({d}_{i} = \left[ {w}_{1i}, {w}_{2i}, {w}_{3i}, ..., {w}_{Wi}\right]\). This process of translating documents into vectors is illustrated in Fig. 1. All the \(D\) documents in the corpus can then be stacked into a \(W\times D\) matrix \({\varvec{Y}}\) where:
$$ {\varvec{Y}}_{W \times D} = \left[ {\begin{array}{*{20}c} {w_{11} } & {w_{12} } & {w_{13} } & \ldots & {w_{1D} } \\ {w_{21} } & {w_{22} } & {w_{23} } & \ldots & {w_{2D} } \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ {w_{W1} } & {w_{W2} } & {w_{W3} } & \ldots & {w_{WD} } \\ \end{array} } \right] = \left[ { \begin{array}{*{20}c} {d_{1} } & {d_{2} } & {d_{3} } & \ldots & {d_{D} } \\ \end{array} } \right] $$
(1)
In this formulation, the values \({Y}_{ij}\) in matrix \({\varvec{Y}}\) refer to the score for word \(i\) in document \(j\) [8].
Under LSA, topics are latent variables that manifest in the co-occurrence of particular words in a document. These latent variables are discovered with the use of SVD, decomposing the corpus matrix as \({\varvec{Y}} = {\varvec{X}}{\varvec{S}}{\varvec{B}}\), where \({\varvec{X}}\), \({\varvec{S}}\), and \({\varvec{B}}\) are \(W\times T\), \(T\times T\), and \(T\times D\), respectively [5]. \({\varvec{X}}\) and \({\varvec{B}}\) are both orthogonal matrices, and \({\varvec{S}}\) is a diagonal matrix containing singular values for \({\varvec{B}}\). SVD maps the original word-document scores in matrix \({\varvec{Y}}\) onto a lower-dimensional latent space, such that the documents are represented by a set of linearly independent base vectors [9].
\({\varvec{X}}\) is interpreted as a dictionary matrix, where \({X}_{ik}\) gives the level of importance of word \(i\) to topic \(k\). Meanwhile, \({\varvec{B}}\) is interpreted as the topic distribution matrix, with \({B}_{kj}\) gives the level of expression of topic \(k\) in document \(j\). A word that has a stronger association to a certain topic will have a higher value in \({\varvec{X}}\). At the same time, a document that contains more of words related to a certain topic will have a higher value in \({\varvec{B}}\).
PLSA takes a more probabilistic approach to topic modelling [6]. Letting \(d\) and \(w\) be indices referring, respectively, to a specific document and word, while \(z\) as an index for topic, the joint probability of observing a word and a document is given by \(P(d,w)\) as
$$ P\left( {d,w} \right) = P\left( d \right)\mathop \sum \limits_{{t \in \left\{ {1,2, \ldots ,T} \right\}}} P\left( {w{|}t} \right)P\left( {t{|}d} \right) = \mathop \sum \limits_{{t \in \left\{ {1,2, \ldots ,T} \right\}}} P\left( t \right)P\left( {d{|}t} \right)P\left( {w{|}t} \right) $$
(2)
The probability model in (2) appears analogous to LSA. \(P(d,w)\) can be viewed as a \(D\times W\) matrix similar to the matrix \({\varvec{Y}}\) of word scores, while \(P(t)\), \(P(d|t)\), and \(P(w|t)\) are analogous to matrices \({\varvec{S}}\), \({\varvec{B}}\), and \({\varvec{X}}\). Instead of using a matrix decomposition procedure like SVD, PLSA is estimated using the EM algorithm, see for example [6] for further details.
PLSA, however, lacks a probabilistic model at the level of documents [7]. There is proper model for what produces \(P\left(t|d\right)\) in Eq. (2). Treating \(d\) as an index, this implies that the number of parameters to be estimated grows linearly with the documents in the corpus. At the same time, supposing there are \(d\in \left\{\text{1,2},\dots ,D\right\}\) documents in the corpus at training time, the model will have no way of modelling any further, yet-unseen documents at testing time, since any indices \(d\in \{D+1,D+2,\dots \}\) will have been outside the bounds of the model.
LDA corrects this problem with redefining topic expressions as arising from a Dirichlet prior [7]. A document is re-expressed as a vector over words, \({\varvec{w}}=\left\{{w}_{1},{w}_{2},\dots ,{w}_{Wd}\right\}\) where \({W}_{d}\) is the number of words in the sequence defining the \(d\) th document. The generative model for this sequence is as follows:
1.
\({W}_{d}\sim Poisson(\eta )\)
 
2.
\(\theta \sim Dirichlet(\alpha )\)
 
3.
For each of the \({W}_{d}\) words, \({w}_{j}\)
a.
Choose topic \({Z}_{k}\sim Multinomial(\theta )\)
 
b.
Choose a word \({w}_{j}\) from \(P\left({w}_{j}|{z}_{k},\beta \right)\), multinomial
 
 
Given the parameters \(\alpha \) and \(\beta \), inference can then be done using the joint distribution
$$ P\left( {\theta ,{\varvec{z}},{\varvec{w}} | \alpha , \beta } \right) = P(\theta |\alpha )\mathop \prod \limits_{k = 1}^{{W_{d} }} P\left( {z_{k} {|}\theta } \right)P\left( {w_{k} {|}z_{k} ,\beta } \right) $$
(3)
The distribution \(\theta \) already refers to the vector of topic expressions per document and is analogous to the \({\varvec{X}}\) matrix in LSA. LDA’s advantage to PLSA is given by the fact that in this formulation, the topic distribution \(\theta \) unlike \({\varvec{X}}\) is trained once from the data. After the parameters of the Dirichlet distribution have already been trained, and the posterior probability already suitably estimated, LDA faces no problems in evaluating new documents [7]. In the event that the trained model encounters a document previously unseen, it simply generates a random sample from the trained posterior [10].
[8] Argue that this particular approach cannot properly take into account the sparsity that ought to be present in realistic settings due to the normed nature of its output (topic expressions are probabilities that must sum to (1). Other criticisms leveled at LDA include its instability and issues with replicability [11]. Researchers have found issues reproducing its results [12, 13] and in certain cases LDA seems to offer little to no improvement to either PLSA or LSA in terms of empirical performance on real data. LSA has been demonstrated to outperformed both LDA and PLSA in terms of accuracy [13] especially on small corpora (150 documents), as [14] and [7] both recommended the use of 1000–3000 documents for training LDA.
Despite continuing widespread adoption of LDA in topic modeling research, the literature has since progressed to more advanced methodologies that address these known limitations in legacy topic models. An entire class of these new methods are motivated with replacing the standard vector space model, which treat words as atomic units, or exchangeable vector components, and in its place employing word embeddings. Word embeddings represent the words as vectors whose distance can be measured in reference to other word vectors [15]. In consequence, a word that is contextually similar (e.g., “food” and “eat”) or syntactically related (“thank” and “you”) appear with much more proximity in the vector space. Word embeddings have been shown to perform better in assessing syntax and semantic relationship between words in corpora in comparison to other neural network-based methods.
An LDA implementation built on top of word embeddings (lda2vec), has been demonstrated to produce topics that have much better syntactic and semantic cohesion, as contextual relationship between words have been preserved with word embeddings [16]. A similar reformulation of LDA based on word embeddings involves adding multivariate Gaussian distributions over the embedding space and estimated using a collapsed Gibbs sampler. This Gaussian LDA is found to produce similar topics as a standard LDA, but has the advantage of properly accounting for out-of-vocabulary words: words that were not part of the corpus at training time, but encountered in the hold-out set [17]. Finally, a nonparametric method of word-embeddings-based topic modeling that applies a combination of hierarchical Dirichlet processes while regularizing word densities into a unit sphere via the von Mises-Fisher distribution has provided a way by which the number of existing topics can be flexibly discovered in an unsupervised manner [18].

3 Semiparametric Topic Modeling (SemiparTM)

This section presents detailed formulation of SemiparTM. We introduce the two stages of the method, along with the cross-validation process for tuning its shrinkage penalty.
Consider the corpus we formulated in (1) containing raw word counts. Suppose that the number of topics expected in the corpus, \(T\), is known beforehand. We wish to decompose matrix \({\varvec{Y}}\) into two matrices \({\varvec{X}}\) and \({\varvec{B}}\), correspondingly of dimensions \(W\times T\) and \(T\times D\), where the former is a matrix of topic distribution over words (how likely a word is to appear under a certain topic) and the latter is a matrix of topic expression over documents (how much of a topic is present in a document). Factorization of \({\varvec{Y}}\) then becomes a matter of finding appropriate matrices \({\varvec{X}}\) and \({\varvec{B}}\) both non-negative that minimizes the objective function
$$ \left( {\user2{\hat{X}},\user2{\hat{B}}} \right) = \arg \min \left\| {\user2{Y} - \user2{XB}} \right\|_{2}^{2} + \zeta \left( {\left\| \user2{X} \right\|_{1} + \left\| \user2{B} \right\|_{1} } \right) $$
(4)
where \( \left\| \cdot \right\|_{2}^{2} \) represents the \({L}_{2}\) norm while \( \left\| \cdot \right\|_{1} \) represents the \({L}_{1}\) norm. Minimization of (4) is done through a similar gradient descent algorithm according to [19], with multiplicative update rules presented in Eqs. (5) and (6).
$$ X_{{ik}} = X_{{ik}} \frac{{\left( {YB^{\prime } } \right)_{{ik}} }}{{\left( {XBB^{\prime } } \right)_{{ik}} + \zeta }} $$
(5)
$$ B_{{kj}} = B_{{kj}} \frac{{\left( {XY} \right)_{{kj}} }}{{\left( {X^{\prime } XB} \right)_{{kj}} + \zeta }} $$
(6)
The constant \(\zeta \ge 0\) is a tuning parameter controlling the level of sparsity in the estimated matrices \({\varvec{X}}\) and \({\varvec{B}}\). Higher values of \(\zeta \) would lead to more components of the two matrices being populated by zeroes. We simplify this parameter to apply to both \({\varvec{X}}\) and \({\varvec{B}}\), though the optimization may also be performed giving different sparsity penalties for either (e.g., \({\zeta }_{1}\) for \({\varvec{X}}\), \({\zeta }_{2}\) for \({\varvec{B}}\)). The sparsity penalty is supposed to represent what [8] argue must be the natural behavior of topic distributions: that topics are defined by the co-occurrence of very few, specific words, while documents can only have a very narrow set of topics. The value of \(\zeta \) can be set to any fixed nonnegative constant, or optimized via cross-validation as follows:
1.
Divide the matrix \({\varvec{Y}}\) randomly into \(K\) equal folds, such that \({{\varvec{Y}}}^{(1)},{{\varvec{Y}}}^{(2)},\dots ,{{\varvec{Y}}}^{(K)}\) form partitions of \({\varvec{Y}}\)
 
2.
For each \(k=\text{1,2},\dots ,K\) take the fold \({{\varvec{Y}}}^{\left(k\right)}\) as the training set \({{\varvec{Y}}}_{train}\). Combine the rest of the \(K-1\) folds to form \({{\varvec{Y}}}_{test}\).
a.
Perform nonnegative LSA on \({{\varvec{Y}}}_{train}\) to obtain \({{\varvec{X}}}_{train}\) and \({{\varvec{B}}}_{train}\)
 
b.
Estimate \({{\varvec{B}}}_{test}\) via semiparametric regression stage on \({{\varvec{Z}}}_{test}\)
 
c.
Measure resulting error \({E}_{k}={\left|\left|{{\varvec{Y}}}_{test}-{{\varvec{X}}}_{train}{{\varvec{B}}}_{test}\right|\right|}_{2}^{2}\)
 
 
3.
Take \(\zeta =\text{arg}\underset{k}{\text{min}}{E}_{k}\)
 
Stage 2: Regression. Aside from a word frequency matrix \({\varvec{Y}}\), we also assume that there exists with the corpus a data matrix \({\varvec{Z}}\) containing external information regarding the documents themselves. Suppose there are \(p\) of these external variables \({Z}_{1},{Z}_{2},\dots ,{Z}_{p}\). The second stage of the topic modelling procedure models the topic distributions of the documents against these external variables using semiparametric regression, with the topic expressions as dependent variables.
For a set of \(D\) documents, a single row of the matrix \({\varvec{B}}\), denoted as \({b}^{(t)}\) where \(t\) is any one of the \(T\) topics generated, a hierarchical equation can be placed over \({b}^{(t)}\) such that for the \(i\) th document:
$${b}_{i}^{(t)}={\beta }_{0}^{\left(t\right)}+\sum_{l=1}^{p}{f}_{l}^{\left(t\right)}\left({Z}_{il}\right)+{\varepsilon }_{il}$$
(7)
where \({\varepsilon }_{il}\sim Normal\left(0,1\right)\) and each \({f}_{l}^{\left(t\right)}\) is some function describing the relationship of external variable \({Z}_{l}\) with the topic \(t\) as represented by the values of \({b}^{(t)}\). The semiparametric method allows using any regression method to estimate \({f}_{l}^{\left(t\right)}\), however, for the purposes of this paper, we use B-splines.
What follows is a brief note on the conceptual justification for this regression stage. PLSA is over-parametrized since it has no generative model for the distribution of topics in a particular document [7]. The regression stage resolves this by substituting \(P\left(t|d\right)\) in PLSA for a function that provides the values for topic expressions \(t\). In place of a conditional probability on the index, we identify \(p\) external variables \({\varvec{Z}}\) referring to measurements, characteristics of the document \(d\) that directly imposes an effect on what topics will manifest in such document. Assuming these variables exist, the function for the per-document topic expressions is then revised into
$$P\left(t|d\right)\propto h(t|{Z}_{d1},{Zd}_{2},\dots ,{Z}_{dl})$$
(8)

4 Simulation Study

We discuss the design of the simulation study through which the performance of SemiparTM will be compared with those of the legacy methods across a range of simulation scenarios. Evaluation procedures and metrics are also outlined in this section.
The data generation process begins with defining the external variables that will determine topic distributions across documents:
$${Z}_{1}\sim Po\left(1\right), {Z}_{2}\sim N\left(\text{20,7}\right),{ Z}_{3}\sim Be\left(0.8\right), {Z}_{4}\sim Beta\left(\text{6,2}\right), {Z}_{5}\sim Beta\left(\text{10,2}\right)$$
(9)
Assuming \(T=10\) topics, the score \({b}_{k}\) of each document to the \(k\) th topic is determined by a hidden model \({b}_{0k}\) a function of \({\varvec{Z}}\) filtered by
$$ b_{k} = \left\{ {\begin{array}{*{20}l} {b_{0k} ,} \hfill & {with\,\,probability\,\, 1 - s} \hfill \\ {0, } \hfill & {with\,\,probability \,\,s} \hfill \\ \end{array} } \right. $$
(10)
where \(s\) is the data sparsity parameter, dictating what proportion of the topic and dictionary matrices are expected to be populated with zero cells. The topic models \({b}_{0k}\) will be specified differently, with the first four taking on linear forms (11 thru 14), the next three as nonlinear (15, 16, 17), and the remaining as having correlations with the other \({b}_{k}\) topic scores (18, 19, 20).
$${b}_{01}=-1+{Z}_{1}+0.2{Z}_{2}+{Z}_{3}-0.9{Z}_{4}-2{Z}_{5}+m\epsilon $$
(11)
$${b}_{02}=3+1.5{Z}_{1}+0.15{Z}_{2}-5{Z}_{3}-5{Z}_{5}+m\epsilon $$
(12)
$${b}_{03}=2+0.2{Z}_{2}-1.4{Z}_{1}+m\epsilon $$
(13)
$${b}_{04}=1.6{Z}_{1}+8{Z}_{3}-9{Z}_{4}+m\epsilon $$
(14)
$${b}_{05}=\frac{{Z}_{1}^{2}}{5{Z}_{5}}+m\epsilon $$
(15)
$${b}_{06}=6\text{sin}\left({Z}_{5}{Z}_{1}\right)+m\epsilon $$
(16)
$${b}_{07}=2+3{Z}_{1}{Z}_{4}-2{Z}_{3}+m\epsilon $$
(17)
$${b}_{08}=1+10{Z}_{4}-2{b}_{3}+m\epsilon $$
(18)
$${b}_{09}=0.2{Z}_{2}+0.2{b}_{7}+m\epsilon $$
(19)
$${b}_{010}=-5+0.9{b}_{1}-1.2{b}_{7}+m\epsilon $$
(20)
where \(\epsilon \sim N\left(\text{0,1}\right)\). The value of \(m\) will determine the level of misspecification error as it affect the model by inflating the variance of the error term thus contaminating model specification. In the simulations, \(m\) will be set to have two values, \(m\in \left\{\text{1,2}\right\}\). Combining each \(b\) will result in a matrix the simulated matrix \({{\varvec{B}}}_{10\times D}={\left[{{b}{\prime}}_{1},{{b}{\prime}}_{2},\dots ,{{b}{\prime}}_{10}\right]}{\prime}\).
To simulate a sparse \({{\varvec{X}}}_{W\times 10}=\{{x}_{kj}\}\) matrix according to the same data sparsity parameter \(s\) used in (10), we generate random samples from a zero-inflated Poisson distribution as
$$ x_{kj} = \frac{1}{90}x_{kj}^{*} \sim f\left( {x_{kj}^{*} } \right) = \left\{ {\begin{array}{*{20}l} {s + \left( {1 - s} \right)e^{ - 100} ,} \hfill & {x_{kj}^{*} = 0} \hfill \\ {\left( {1 - s} \right)\frac{{100x_{kj}^{*} e^{ - 100} }}{{x_{kj}^{*} !}},} \hfill & { x_{kj}^{*} > 0} \hfill \\ \end{array} } \right. $$
(21)
The process as presented here proceeds in a backwards direction, beginning with identifying the auxiliary information (the \({Z}_{l}\) variables in Eq. (9)) and then connecting them to the topic scores (Eqs. (11)–(20)), which will then produce the corpus \({\varvec{Y}}\). In practice, analysis begins with collecting the corpus \({\varvec{Y}}\) and using topic modeling to discover the topic and dictionary matrices. In SemiparTM, the regression stage will then be used to uncover the connection between the topic scores in \({\varvec{B}}\) with the \({Z}_{l}\) variables.
The analyst is tasked with identifying what variables are to be used for each \({Z}_{l}\). The distributions proposed in Eq. (9) represent only an example of what, returning to the restaurant company scenario presented in Sect. 1, may likely be the variables accessible to an analyst and that may determine the type of feedback the company will receive. The Poisson distribution assumed for \({Z}_{1}\) may represent the average number of times that a customer in the feedback database has sent a complaint to the company (on average, a customer may send only one complaint in their entire lifetime, but there exist certain customers that periodically send in feedback). Meanwhile, the normal distribution in \({Z}_{2}\) may represent the number of years that the restaurant or branch that received the complaint has been in operation (average of 20 years); \({Z}_{3}\) an indicator variable of whether that branch has more than 1 business channel (Dine In, Take Home, Delivery, Parties and Business Functions, Drive-Thru, etc.), while \({Z}_{4}\) and \({Z}_{5}\) are scores that the store received during a corporate audit.
These variables, in fact, represent actual variables that were used in the application of the proposed method to a corpus of customer feedback obtained from a food service establishment. The distributions and their parameters were set so as to approximate the possible distributions of these data (see histograms of actual data in Fig. 2). However, it should be noted that the proposed method is not limited to the above distributions, and these may be replaced (or supplemented) by other variables depending on availability and accessibility.
Simulations are performed across settings of the following parameters: corpus size, vocabulary size, number of underlying topics, and presence of topic correlations. These settings take into consideration known weaknesses of different methods noted in the literature.
Simulations are based on the combinations of each of the parameters in Table 1, including the two levels of the misspecification error \((m=\text{1,2})\). The designation of “small”, “medium”, and “large” refer primarily to the size and dimensionality of the corpus. Meanwhile, sparsity levels on these three levels can be interpreted as being in the “low” sparsity, “medium” sparsity, and “high” sparsity scenarios. Simulation settings in Table 1 have been designed to capture specific characteristics of the corpus of real feedback data obtained for the application presented in Sect. 6. We further discuss in Sect. 6 further detail regarding this corpus and the results after applying SemiparTM and the legacy methods, but with a size of 253 documents and 844 unique words after cleaning, the corpus falls comfortably within the “Small” category.
Table 1
Levels used for each simulation parameter
Parameter
 
Mid
High
Document (\(D\))
150
1000
3000
Words (\(W\))
500
1500
3500
Data sparsity (\(s\))
0.70
0.90
0.99
The number of underlying topics for this simulation study is fixed, i.e., it is assumed that actual underlying latent topics exist in the corpus, and that the analyst has prior knowledge of how many there are. Models are assessed based on how similar they are to the true topic distribution, the specifics of which will be discussed subsequently. This is to assess proposed claims towards instability of LDA by looking into how consistently and closely it reconstructs a set of true topic distributions. This would provide a much stronger argument for or against LDA, as prior studies cited have thus far focused only in replicating results. A simulation study comparing true versus estimated topic structures by LDA and the other topic models would allow for more objective comparisons between them.
Proximity between estimated and true topic structures will be evaluated through the cosine similarity measure. The algorithm runs as follows: given the true topic matrix \({\varvec{X}}\) and its estimate \(\widehat{{\varvec{X}}}\) from one of the four methods, the sum in (22) represents the total cosine similarity between the true and estimated matrix \({\varvec{X}}\), in effect the total cosine similarity of a method’s estimate to the true topic distribution. The same measure will be used in estimating the difference between the true and estimated topic distribution matrix \({\varvec{B}}\), given in (23).
$$\rho \left({\varvec{X}},\widehat{{\varvec{X}}}\right)=\frac{1}{T}{\sum }_{i,j}\frac{{x}_{i} \cdot {\widehat{x}}_{j}}{\left|\left|{x}_{i}\right|\right|\times \left|\left|{\widehat{x}}_{j}\right|\right|}$$
(22)
$$\rho \left({\varvec{B}},\widehat{{\varvec{B}}}\right)=\frac{1}{T}{\sum }_{i,j}\frac{{b}_{i} \cdot {\widehat{b}}_{j}}{\left|\left|{b}_{i}\right|\right|\times \left|\left|{\widehat{b}}_{j}\right|\right|}$$
(23)

5 Results and Discussions

Three versions of SemiparTM are implemented, each at varying configurations defined by the shrinkage penalty: SemiparTM-1 using a shrinkage penalty fixed at 1, SemiparTM-3 using a penalty fixed at 3, and SemiparTM-cv using the cross-validation algorithm outlined in Sect. 3 to adaptively set the penalty. We compare the performances of the four methods on varying sizes (in number of documents) and dimensionality (number of unique words) of the corpus. Similarity scores are based on cosine similarity measures and characterizes the actual and predicted topic distribution.
Table 2 exhibits superior similarity scores for the three configurations of SemiparTM over the legacy methods for the topic distribution matrix at least for the training corpus. PLSA is comparable to SemiparTM in fewer document-words combination. For the holdout corpus, similarity scores decline for all methods, but SemiparTM still manages to outperform the legacy methods for some document-word combinations of the corpora. In the training corpus, modeling of topic distributions by SemiparTM is accomplished by the nonnegative matrix factorization stage (Stage 1), while on the holdout corpus is done by the semiparametric regression stage (Stage 2). While similarity scores drop in the holdout corpus, performance can still be improved when the training corpus includes more documents, which in consequence means more documents for the regression stage to use in tuning the estimates. It is noted that this is equivalent to some decline in the cosine similarity for the training corpus.
Table 2
Average cosine similarities on the topic distribution and dictionary matrices, across levels of document and word count
Method
# Docs
Topic distribution matrix
Dictionary matrix
Training
Holdout
Training
# Words
# Words
# Words
500
1500
3500
500
1500
3500
500
1500
3500
LSA
150
0.236
0.216
0.212
0.242
0.221
0.270
0.429
0.424
0.441
1000
0.239
0.250
0.232
0.264
0.287
0.182
0.408
0.393
0.389
3000
0.236
0.330
0.242
0.234
0.338
0.231
0.398
0.438
0.384
PLSA
150
0.770
0.732
0.806
0.102
0.134
0.169
0.618
0.572
0.632
1000
0.631
0.631
0.582
0.088
0.109
0.053
0.318
0.318
0.283
3000
0.661
0.602
0.702
0.083
0.082
0.086
0.358
0.312
0.404
LDA
150
0.224
0.182
0.171
0.234
0.174
0.211
0.417
0.375
0.392
1000
0.234
0.230
0.219
0.255
0.290
0.174
0.309
0.298
0.321
3000
0.218
0.231
0.221
0.219
0.218
0.199
0.289
0.340
0.305
SemiparTM-1
150
0.696
0.736
0.747
0.190
0.182
0.271
0.601
0.559
0.606
1000
0.700
0.665
0.675
0.263
0.264
0.155
0.385
0.332
0.352
3000
0.592
0.647
0.622
0.210
0.210
0.172
0.285
0.343
0.314
SemiparTM-3
150
0.696
0.736
0.747
0.190
0.182
0.271
0.601
0.559
0.606
1000
0.700
0.665
0.675
0.263
0.264
0.155
0.385
0.332
0.352
3000
0.592
0.647
0.622
0.210
0.210
0.172
0.285
0.343
0.314
SemiparTM-cv
150
0.690
0.736
0.739
0.104
0.096
0.117
0.601
0.559
0.605
1000
0.698
0.664
0.667
0.295
0.254
0.181
0.384
0.331
0.350
3000
0.584
0.626
0.622
0.450
0.410
0.386
0.276
0.332
0.309
Increasing the size of the training corpus from 150 to 3000 led to a decrease in average similarity score from 0.690 to 0.584 in the cross-validated SemiparTM (assuming a fixed vocabulary of 500 words). However, on the holdout corpus, this was an equivalent to an increase from 0.104 to 0.450, which nearly closed the gap in average similarity score between the two corpora.
Similar patterns can be observed in the similarity scores for the estimated dictionary matrices, although here there is no comparison to be made between training and holdout corpus as the dictionary is held fixed for the two corpora. Increasing the number of documents has the effect of diminishing quality of estimation in terms of their cosine similarities. Although in both topic distribution and dictionary matrices, the decrease in similarity scores when increasing the number of documents from 1000 to 3000 appears to be minimal in comparison to changes from 150 to 1000.
The behavior of SemiparTM across levels of sparsity in the corpus is compared to the legacy methods for both the training and holdout corpora in Table 3. With regards to performance on the topic distribution matrix, however, this shows a narrative that has already been explored in Table 2, with SemiparTM configurations demonstrating superior performance compared with the legacy methods on the training corpus in terms of how well they estimated the topic distribution matrix. Only here, PLSA yield more comparable similarity measures to SemiparTM. Increasing the level of sparsity doesn’t put much of a burden on either PLSA or the SemiparTM configurations on the training corpus. Again, it is on the holdout corpus where this narrative changes. On the holdout corpus, SemiparTM not only drops in performance, but also appears to be extremely sensitive to increasing levels of sparsity. SemiparTM-cv, especially, nearly zeroes out with a corpus size of 3000 documents but with 0.99 sparsity. The other two SemiparTM configurations fare significantly better and even remain at comparable levels with the legacy methods, which at this point is admittedly low for all the methods.
Table 3
Average cosine similarities on the topic distribution and dictionary matrices, across levels of document count and sparsity level
Method
# Docs
Topic distribution matrix
Dictionary matrix
Training
Holdout
Training
Sparsity
Sparsity
Sparsity
0.70
0.90
0.99
0.70
0.90
0.99
0.70
0.90
0.99
LSA
150
0.382
0.221
0.061
0.426
0.259
0.048
0.555
0.415
0.323
1000
0.418
0.233
0.070
0.415
0.243
0.074
0.525
0.356
0.309
3000
0.415
0.236
0.062
0.423
0.259
0.018
0.512
0.364
0.312
PLSA
150
0.775
0.733
0.800
0.250
0.131
0.025
0.845
0.671
0.307
1000
0.676
0.588
0.580
0.148
0.084
0.018
0.520
0.269
0.129
3000
0.740
0.616
0.620
0.168
0.078
0.004
0.528
0.281
0.264
LDA
150
0.323
0.189
0.065
0.337
0.215
0.067
0.582
0.398
0.204
1000
0.377
0.234
0.073
0.392
0.244
0.083
0.455
0.281
0.192
3000
0.368
0.229
0.064
0.388
0.247
0.012
0.463
0.250
0.182
SemiparTM-1
150
0.774
0.682
0.721
0.382
0.173
0.088
0.840
0.667
0.258
1000
0.722
0.647
0.672
0.381
0.234
0.067
0.505
0.332
0.232
3000
0.695
0.585
0.532
0.393
0.218
0.003
0.500
0.230
0.163
SemiparTM-3
150
0.774
0.682
0.721
0.382
0.173
0.088
0.840
0.667
0.258
1000
0.722
0.647
0.672
0.381
0.234
0.067
0.505
0.332
0.232
3000
0.695
0.585
0.532
0.393
0.218
0.003
0.500
0.230
0.163
SemiparTM-cv
150
0.762
0.682
0.721
0.249
0.065
0.001
0.839
0.667
0.258
1000
0.712
0.646
0.672
0.594
0.135
0.001
0.502
0.331
0.232
3000
0.681
0.574
0.531
0.995
0.310
0.000
0.489
0.217
0.161
Once more, no comparison is made with the holdout corpus for the dictionary matrix because these values are retained between training and holdout corpora. The SemiparTM configurations, along with PLSA, yielded higher cosine similarity scores than either LSA or PLSA in small corpus sizes with low to medium sparsity (0.70–0.99), but with bigger corpora or higher levels of sparsity, this difference is quickly closed.
This next set of results analyzes the impact of misspecification errors on the methods. It has been observed before that SemiparTM, unlike the legacy methods, can take advantage of increasing number of documents in the training corpus as leverage for improving estimation performance on the holdout corpus. Thus, in Table 4 despite a drop in cosine similarity between the two corpora, a drop that is also observed for PLSA, which yielded high similarity scores in the training corpus, SemiparTM is observed to actually increase in similarity scores as the size of the training corpus increases. This is a primary benefit of the semiparametric regression stage.
Table 4
Average cosine similarities on the topic distribution and dictionary matrices, across levels of document count and misspecification error
Method
# Docs
Topic distribution matrix
Dictionary matrix
Training
Holdout
Training
Misspecification error
Misspecification
Misspecification error
None
Present
None
None
1
2
LSA
150
0.228
0.215
0.247
0.241
0.431
0.432
1000
0.248
0.233
0.249
0.239
0.398
0.395
3000
0.251
0.242
0.253
0.236
0.390
0.410
PLSA
150
0.786
0.753
0.136
0.134
0.612
0.603
1000
0.611
0.618
0.089
0.077
0.316
0.296
3000
0.652
0.666
0.086
0.080
0.359
0.357
LDA
150
0.211
0.174
0.225
0.187
0.396
0.393
1000
0.239
0.217
0.248
0.231
0.306
0.313
3000
0.225
0.215
0.224
0.208
0.301
0.296
SemiparTM-1
150
0.749
0.703
0.220
0.209
0.589
0.588
1000
0.680
0.680
0.236
0.218
0.365
0.347
3000
0.592
0.616
0.210
0.199
0.289
0.305
SemiparTM-3
150
0.749
0.703
0.220
0.209
0.589
0.588
1000
0.680
0.680
0.236
0.218
0.365
0.347
3000
0.592
0.616
0.210
0.199
0.289
0.305
SemiparTM-cv
150
0.746
0.698
0.112
0.099
0.589
0.587
1000
0.675
0.678
0.279
0.208
0.365
0.344
3000
0.586
0.604
0.458
0.412
0.280
0.298
Even with misspecification error on the true models determining the relationship between each topic to the auxiliary information, SemiparTM configurations do not exhibit large declines in its similarity scores on the hold-out topic distribution matrix. In the cross-validated configuration of SemiparTM, the highest similarity score received in the large corpus in the presence of misspecification error (\(m = 2\)) is only a few units away in the equivalent setting without misspecification error. The nonparametric nature of the topic models used in SemiparTM contributes to its robustness to misclassification error. Thus, when the analyst is confounded with limited options on auxiliary variables to include, SemiparTM resolves this easily through its robustness to misspecification errors.
Even the nonnegative matrix factorization step, which produced the dictionary matrix being assessed, handled misspecification errors quite well. At least in the small corpus setting, the SemiparTM configurations were able to consistently meet with PLSA at having the highest similarity scores under both levels of misspecification error.
Similar observations can be made in Table 5. Only at increased levels of dimensionality on the vocabulary the second stage of the method does see decreases in overall cosine similarities in the topic distribution matrix. However, this decreases still put the SemiparTM configurations at levels well comparable with the legacy methods, especially LDA. In fact, in the small corpus setting, SemiparTM-cv was still able to yield the highest similarity scores even with the presence of misspecification error.
Table 5
Average cosine similarities on the topic distribution and dictionary matrices, across levels of vocabulary size and misspecification error
Method
# Words
Topic distribution matrix
Dictionary matrix
Training
Holdout
Training
Misspecification error
Misspecification error
Misspecification error
None
Present
None
Present
None
Present
LSA
500
0.244
0.231
0.252
0.242
0.402
0.422
1500
0.246
0.233
0.259
0.259
0.415
0.406
3500
0.231
0.217
0.234
0.218
0.416
0.409
PLSA
500
0.690
0.685
0.096
0.086
0.441
0.422
1500
0.681
0.667
0.116
0.120
0.439
0.427
3500
0.697
0.692
0.114
0.103
0.455
0.450
LDA
500
0.236
0.214
0.247
0.225
0.345
0.332
1500
0.221
0.195
0.241
0.220
0.326
0.347
3500
0.213
0.182
0.210
0.176
0.352
0.351
SemiparTM-1
500
0.656
0.668
0.229
0.213
0.422
0.426
1500
0.717
0.674
0.226
0.218
0.439
0.433
3500
0.714
0.692
0.218
0.201
0.472
0.455
SemiparTM-3
500
0.656
0.668
0.229
0.213
0.422
0.426
1500
0.717
0.674
0.226
0.218
0.439
0.433
3500
0.714
0.692
0.218
0.201
0.472
0.455
SemiparTM-cv
500
0.650
0.665
0.317
0.249
0.416
0.424
1500
0.715
0.671
0.206
0.186
0.439
0.430
3500
0.708
0.683
0.186
0.155
0.473
0.452
Table 6 explores cosine similarities across joint impacts of both sparsity and misspecification. It has been noted that SemiparTM is sensitive to sparsity, but is protected against significant performance declines due to misspecification in the regression models, and this has been visualized exactly in both tables. For both the training and the holdout corpus, while the cosine similarity may be declining from top to bottom (increasing sparsity), the differences tend to be small with or without misspecification error. That the pattern of the values tends to be the same for all methods suggests that across levels of sparsity and misspecification error, SemiparTM in general performs at comparably the same levels as the legacy methods. In terms of cosine similarity, it is only in terms of dimensionality and corpus size at which this general performance really begins to be differentiated among them. In particular, it has been observed numerous times through these simulation results that SemiparTM tends to be better-suited at modeling small corpora.
Table 6
Average cosine similarities on the topic distribution and dictionary matrices, across levels of sparsity and misspecification error
Method
Spar
Topic distribution matrix
Dictionary matrix
Training
Holdout
Training
Misspecification error
Misspecification error
Misspecification error
None
Present
None
Present
None
Present
LSA
0.70
0.411
0.395
0.431
0.411
0.532
0.537
0.90
0.238
0.220
0.255
0.250
0.382
0.381
0.99
0.068
0.062
0.054
0.054
0.313
0.318
PLSA
0.70
0.726
0.730
0.197
0.189
0.658
0.649
0.90
0.648
0.656
0.108
0.097
0.448
0.421
0.99
0.694
0.659
0.018
0.018
0.227
0.226
LDA
0.70
0.385
0.322
0.404
0.334
0.515
0.501
0.90
0.217
0.212
0.233
0.232
0.318
0.327
0.99
0.073
0.063
0.066
0.060
0.192
0.199
SemiparTM-1
0.70
0.740
0.736
0.397
0.371
0.645
0.635
0.90
0.651
0.647
0.214
0.199
0.444
0.453
0.99
0.683
0.648
0.064
0.063
0.236
0.223
SemiparTM-3
0.70
0.740
0.736
0.397
0.371
0.645
0.635
0.90
0.651
0.647
0.214
0.199
0.444
0.453
0.99
0.683
0.648
0.064
0.063
0.236
0.223
SemiparTM-cv
0.70
0.730
0.722
0.572
0.488
0.642
0.630
0.90
0.648
0.647
0.162
0.118
0.441
0.450
0.99
0.683
0.647
0.001
0.001
0.236
0.222

6 Application to Actual Customer Feedback Corpus

We demonstrate the use of SemiparTM on an actual corpus of customer feedback data obtained from a food service establishment, collected through its digital channels. These channels include SMS, e-mail, and the feedback form available on the company’s website. The feedback was gathered over a defined period, from January 1, 2020, to December 5, 2020. To ensure privacy and confidentiality, all data were anonymized prior to analysis, and no personal details of the feedback authors, such as age or gender, were included.
Nevertheless, the data contained information considered sensitive under the Philippine Data Privacy Act of 2012 (Republic Act 10,173): information such as customer names, contact numbers, and addresses of stores through which their feedback was sent (plausible to connect with the customers’ own locations). Likewise, the data also contained information that the company considered privileged and likely to challenge its public reputation: brand names, registered trademarks, and words pertaining to negative experiences and practices associated with the business. Such information was deleted (in the case of customer details) or concealed (for privileged company information that may still be used for analysis). Tokens pertaining to brand names, trademarks, product names, and any other words that may be connected with the company have been anonymized with generic tokens like "[productname]", "[location]", "[employee]" and the likes.
Other standard practices for data cleaning in NLP research were followed. Derivative words were clustered via stemming using the quanteda [20] package for R. Also, stop-words that carry no contextual meaning and instead serve mainly syntactic purposes (e.g., a, is, at, on) were removed using data provided with the tidytext [21] R package. After cleaning, the corpus contained a set of 253 documents with a corpus of 844 unique words. Document length is an average of 19.9 words. Corpus sparsity is at 0.79.
For the semiparametric topic model, auxiliary variables used are the following: (1) average number of feedback received from the authors of each document (customers sending in the feedback), (2) length of operations in number of years since first opening, (3) a binary variable on whether or not the store being referenced in the feedback services more than two business channels, (4) average performance in a service audit, and (5) average performance in a product quality audit. Histograms of these auxiliary variables are presented in Fig. 2. Table 7 presents the first four topics produced by LSA, PLSA, LDA, and SemiparTM-cv under the assumption of 10 latent topics in the corpus.
Table 7
First four topics from SemiparTM-cv and the legacy methods
Topic
Method
LSA
PLSA
LDA
SemiparTM-cv
1
Sour
Sweet
Spici
Dark
Implement
Measur
Salamat
Asap
Takeout
Deduct
Servic
Manag
Time
Food
Store
Wait
Cashier
Branch
Becaus
Crew
Deliveri
Payment
Deliv
Call
Card
Onlin
Cancel
Refund
Fail
Transact
Manag
Branch
Time
Crew
Servic
Cashier
Becaus
Receipt
Follow
Guest
[product name]
[product name]
Receiv
Meal
[product name]
[bank]
[product name]
[product name]
Text
pcs
Onlin
Card
Pleas
Alreadi
Refund
Payment
Transact
Receiv
Cancel
Attach
[brand]
Redeem
Treat
Alway
Offlin
Becaus
[bank]
Mall
Complaint
Food
Hindi
Sabi
wala
Sana
Ano
Bigay
Kayo
Tapo
Walang
Nalang
2
Sobra
Balik
Account
Lack
Foodpanda
Hintay
[location]
Bayad
Happi
Oil
Alreadi
Hour
Payment
Follow
Deliveri
Guy
Card
Veri
Refund
[brand]
Credit
Email
Pleas
Alreadi
Receiv
Paid
Pm
Deduct
Contact
Account
Guy
Expect
Wait
Happen
Counter
[brand]
Minut
Told
Onli
Sabi
Qualiti
Correct
Pc
Oil
Takeout
Immedi
Pm
Valu
[product name]
Measur
Email
Paid
Fail
Regard
Famili
Charg
Credit
Pay
Refer
Subject
Branch
Card
Onlin
[product name]
Day
[location]
Promo
Onli
Experi
Free
Ulit
Ngayon
Branch
Kain
Crew
Tawag
Sama
Kulang
Deliv
Bakit
3
[Brand]
Food
Hindi
Branch
[product name]
Becaus
Treat
Redeem
Sabi
Wala
Bag
Plastic
Hindi
[product name]
[product name]
[product name]
[product name]
Call
Wast
Wala
[product name]
[product name]
[product name]
Receiv
Meal
[product name]
Qualiti
[product name]
Takeout
Oil
[brand]
Redeem
Treat
Alway
Offlin
Becaus
[bank]
Branch
Mall
Complaint
Fri
Kulang
[product name]
Drink
[product name]
Check
[product name]
Larg
[product name]
Concern
Hindi
Wala
Sabi
Sana
Bigay
Walang
Kayo
Tawag
Ngayon
Tapo
Food
Branch
Time
Cashier
Manag
Servic
Crew
Becaus
Serv
Receipt
Custom
Store
Call
Feedback
Rider
Avail
Offer
Befor
Accept
Pleas
4
Time
Offlin
[product name]
Alway
Pleas
Onli
Cashier
Day
Manag
Veri
Deliveri
Redeem
Treat
Sabi
[brand]
Fri
Deliv
Offlin
Concern
Drink
Measur
Pm
Dark
Implement
Correct
Pcs
Valu
Text
Fri
Immedi
[product name]
Card
Onli
Onlin
Promo
Day
[location]
Experi
Veri
Free
[product name]
Bag
Plastic
Miss
[product name]
Item
Regular
[product name]
Ice
[product name]
Ano
Kanina
Lagay
Isang
Nalang
Kain
Dumat
Sobra
Sama
Bakit
Onli
Told
Sinc
Wait
Store
Hope
Veri
Disappoint
Whi
[brand]
Follow
Ms
Gc
Late
Wait
Note
Regard
Inform
Tawag
Bigay
The table presents the first four topics from SemiparTM-cv and the legacy methods as defined by their corresponding top 20 words with the highest scores in the dictionary matrix \(\mathbf{X}\). Sensitive information, such as tokens pertaining to brand names, product names, and the likes have been hidden with placeholder tokens
One important observation to take note of here is that while the simulation study guarantees that these topic definitions, coming from the dictionary matrix, are expected to be closer to the true topic distributions in this corpus size and sparsity level, interpretability is another matter altogether. For topic 1, SemiparTM-cv associated words that have much to do with payments and transactions: words like “redeem”, “offlin” and specific bank names indicate possible issues that the business’s customers might be having with regards to transacting for products and services. For LDA, this topic covers some overlapping words (“onlin”, “card”, and “payment”). For LSA and PLSA, this topic seems to be more general towards service-related issues, with words like “manag”, “service”, “cashier”, and “time”.
It is typical when performing topic modeling that these topics are interpreted based on their top associated words, using which they are given a name or some identifying description. Topic 1 for SemiparTM-cv may then be on payment and redemption issues, while topic 2 covers promo-related issues (“branch”, “promo”, “free” and mention of product names).
While these topic definitions appear to be comparably useful between the methods, it is nevertheless noted that results of the simulation study warn that for LSA and LDA, quality of these word associations (at least according to the dictionary’s cosine similarity) may be suspect in this particular case when the corpora are particularly small and covering a similarly small vocabulary.

7 Conclusions

The proposed semiparametric topic modeling methodology (SemiparTM) combines known advantages of Latent Semantic Analysis (LSA) for discovering hidden semantic structures or topics in a collection of documents, and the predictive power of semiparametric regression techniques for generating topic distributions of previously unseen documents using a set of auxiliary document information. It has been demonstrated that SemiparTM performs at least at par or better than the three legacy topic models when reconstructing latent topic and vocabulary distributions. This advantage is best observed in small corpora with limited vocabularies.
The use of regression as a second stage of the model proved its benefits when applied to a new set of documents. Unlike PLSA, whose estimation tends to overfit to the training corpus and thus demonstrated significantly worse performance when given new documents, SemiparTM continued to perform at least at the level of LDA. Moreover, when the penalty parameter in the nonnegative LSA stage is tuned via cross-validation, SemiparTM exhibits a definitively superior performance in predicting the topic distributions of documents in a holdout corpus.
SemiparTM was also able to match PLSA with superior cosine similarities in the training corpora even with increasing levels of sparsity. The impact of sparsity is stronger when switching to holdout corpora, and it is here that we observe the second stage’s sensitivity to levels of sparsity. Nevertheless, for small training corpora (150 documents), SemiparTM was still observed to be capable of performing better than PLSA and at par with LDA even at the 0.99 sparsity level. For simply modeling topic structures through the dictionary matrix, the SemiparTM configurations, along with PLSA, yielded higher cosine similarity scores than either LSA or LDA in small corpus sizes with low to medium sparsity (0.70–0.90).
Even in the presence of misspecification errors, SemiparTM configurations did not exhibit any large declines in similarity scores even in the hold-out topic matrix. In fact, the cross-validated SemiparTM achieved similarity scores under misspecification that are only a few units off from the same scores under no misspecification.
This study presents SemiparTM as a viable alternative to existing topic modeling procedures, one that accounts for additional auxiliary information without increasing the overall complexity of the model, and allows for easier automation through its reduced dependence on hyperparameters that would require prior tuning. The following chapter now closes our discussion by providing some practical recommendations for applying SemiparTM for real data analytic projects.

Acknowledgements

Computations performed for this paper was made possible with the support of the Advanced Science and Technology Institute of the Department of Science and Technology (DOST-ASTI), which generously granted us access to their high performance computing facilities of their Computing and Archiving Environment (COARE).

Declarations

Conflict of interest

The authors have no conflict of interest.

Ethical Approval

Simulated data was used in the assessment of the proposed methods.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Ravi Kumar V, Raghuveer K (2013) Legal documents clustering and summarization using hierarchical latent dirichlet allocation. IAES Int J Artifi Intell 2(1):27–35 Ravi Kumar V, Raghuveer K (2013) Legal documents clustering and summarization using hierarchical latent dirichlet allocation. IAES Int J Artifi Intell 2(1):27–35
2.
Zurück zum Zitat Williams T, Betak J (2018) A comparison of LSA and LDA for the analysis of railroad accident text. Procedia Comput Sci 130:98–102CrossRef Williams T, Betak J (2018) A comparison of LSA and LDA for the analysis of railroad accident text. Procedia Comput Sci 130:98–102CrossRef
3.
Zurück zum Zitat Berezina K, Bilgihan A, Cobanoglu C, Okumus F (2016) Understanding satisfied and dissatisfied hotel customers: text mining of online hotel reviews. J Hosp Mark Manag 25(1):1–24 Berezina K, Bilgihan A, Cobanoglu C, Okumus F (2016) Understanding satisfied and dissatisfied hotel customers: text mining of online hotel reviews. J Hosp Mark Manag 25(1):1–24
4.
Zurück zum Zitat Coussement K, Van den Poel D (2008) Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decis Support Syst 44(4):870–882CrossRef Coussement K, Van den Poel D (2008) Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decis Support Syst 44(4):870–882CrossRef
5.
Zurück zum Zitat Deerwester S, Dumais S, Landauer T, Furnas G, Leighton-Beck L (1988) Improving information-retrieval with latent semantic indexing. Proc ASIS Annu Meet 25:36–40 Deerwester S, Dumais S, Landauer T, Furnas G, Leighton-Beck L (1988) Improving information-retrieval with latent semantic indexing. Proc ASIS Annu Meet 25:36–40
6.
Zurück zum Zitat Hoffman T (1999) Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. Association for computing machinery. pp 50–57 Hoffman T (1999) Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. Association for computing machinery. pp 50–57
7.
Zurück zum Zitat Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022 Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
9.
Zurück zum Zitat Steinberger J, Jezek K (2004) Using latent semantic analysis in text summarization and summary evaluation. Proc ISIM 4:93–100 Steinberger J, Jezek K (2004) Using latent semantic analysis in text summarization and summary evaluation. Proc ISIM 4:93–100
10.
Zurück zum Zitat Srivastava A, Sahami M (eds) (2009) Text mining: classification, clustering, and applications. Chapman and Hall/CRC, Boca Raton Srivastava A, Sahami M (eds) (2009) Text mining: classification, clustering, and applications. Chapman and Hall/CRC, Boca Raton
11.
Zurück zum Zitat Aggrawal A, Fu W, Menzies T (2018) What is wrong with topic modelling? And how to fix it using search-based software engineering. Inf Softw Technol 98:74–88CrossRef Aggrawal A, Fu W, Menzies T (2018) What is wrong with topic modelling? And how to fix it using search-based software engineering. Inf Softw Technol 98:74–88CrossRef
12.
Zurück zum Zitat Barua A, Thomas S, Hassan A (2014) What are developers talking about? An analysis of trends in stack overflow. Empir Softw Eng 19(3):619–654CrossRef Barua A, Thomas S, Hassan A (2014) What are developers talking about? An analysis of trends in stack overflow. Empir Softw Eng 19(3):619–654CrossRef
13.
Zurück zum Zitat Krestel R, Fankhauser P, Nejdl W (2009) Latent dirichlet allocation for tag recommendation. Proceedings of the third ACM conference on recommender systems pp 61–68 Krestel R, Fankhauser P, Nejdl W (2009) Latent dirichlet allocation for tag recommendation. Proceedings of the third ACM conference on recommender systems pp 61–68
14.
Zurück zum Zitat Kakkonen T, Myller N, Sutinen E, Timonen J (2008) Comparison of dimension reduction methods for automated essay grading. J Educ Technol Soc 11(3):275–288 Kakkonen T, Myller N, Sutinen E, Timonen J (2008) Comparison of dimension reduction methods for automated essay grading. J Educ Technol Soc 11(3):275–288
15.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint. arXiv:​1301.​3781
17.
Zurück zum Zitat Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. Proceedings of the 53rd meeting of the assoc for compt linguist and the 7th International joint conference on natural language processing pp. 795–804 Das R, Zaheer M, Dyer C (2015) Gaussian lda for topic models with word embeddings. Proceedings of the 53rd meeting of the assoc for compt linguist and the 7th International joint conference on natural language processing pp. 795–804
18.
Zurück zum Zitat Batmanghelich K, Saeedi A, Narasimhan K, Gersham S (2016) Nonparametric spherical topic modeling with word embeddings. Proceedings of the Conference Association for computational linguistics Meeting 537–542 Batmanghelich K, Saeedi A, Narasimhan K, Gersham S (2016) Nonparametric spherical topic modeling with word embeddings. Proceedings of the Conference Association for computational linguistics Meeting 537–542
19.
Zurück zum Zitat Liu W, Zheng S, Jia S, Shen L, Fu X (2010) Sparse nonnegative matrix factorization with the elastic net. 2010 IEEE international conference on bioinformatics and biomedicine (BIBM) pp 265–268 Liu W, Zheng S, Jia S, Shen L, Fu X (2010) Sparse nonnegative matrix factorization with the elastic net. 2010 IEEE international conference on bioinformatics and biomedicine (BIBM) pp 265–268
21.
Zurück zum Zitat Silge J, Robinson D (2016) tidytext: text mining and analysis using tidy data principles in R. J Open Source Softw 1(3):37CrossRef Silge J, Robinson D (2016) tidytext: text mining and analysis using tidy data principles in R. J Open Source Softw 1(3):37CrossRef
Metadaten
Titel
Semiparametric Latent Topic Modeling on Consumer-Generated Corpora
verfasst von
Dominic B. Dayta
Erniel B. Barrios
Publikationsdatum
28.01.2025
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-025-00587-y