Skip to main content
Erschienen in:

Open Access 09.11.2022

Comparison of Semantic Similarity Models on Constrained Scenarios

verfasst von: Rafael Teixeira, Mário Antunes, Diogo Gomes, Rui L. Aguiar

Erschienen in: Information Systems Frontiers | Ausgabe 4/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The technological world has grown by incorporating billions of small sensing devices, collecting and sharing large amounts of diversified data over the new generation of wireless and mobile networks. We can use semantic similarity models to help organize and optimize these devices. Even so, many of the proposed semantic similarity models do not consider the constrained and dynamic environments where these devices are present (IoT, edge computing, 5g, and next-generation networks). In this paper, we review the commonly used models, discuss the limitations of our previous model, and explore latent space methods (through matrix factorization) to reduce noise and correct the model profiles with no additional data. The new proposal is evaluated with corpus-based state-of-the-art approaches achieving competitive results while having four times faster training time than the next fastest model and occupying 36 times less disk space than the next smallest model.
Hinweise

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Semantic similarity is the degree to which textual units (documents Mihalcea et al. (2006), sentences Li et al. (2006), or terms Iosif and Potamianos (2010)) have the same meaning Ilakiya et al. (2012). Given how hard it has been to define meaning, a natural approach to obtain it from a unit is to determine how close it is to other units Abdalla et al. (2021). The ability to judge similarity has been used in many applications, such as automated spelling correction, word sense disambiguation, sentiment analysis, and information retrieval Mohammad and Hirst (2012); Sitikhu et al. (2019); Araque et al. (2019).
Currently, there is no uniform way to represent, share, and understand Internet of Things (IoT) data, leading to information silos that hinder the realization of complex IoT/Machine to Machine (M2M) scenarios. Some of these barriers can be broken down with semantic similarity; however, most models designed for this task do not consider the highly dynamic and variable environment of common IoT Mobile Edge Computing (MEC), and Fifth Generation of Telecommunications (5G) scenarios Afzal et al. (2018); Li et al. (2018). Our work focuses on semantic similarity models designed for these conditions through the already proposed unsupervised learning model that relies on web search engines to learn Distributional Profiles (DP) Antunes et al. (2017, 2021).
This paper contributions are I) a new semantic similarity model based on DP whose focus is dynamic and restrained environments such as IoT, MEC, and 5G. II) an exhaustive comparison of the proposed model against the state-of-the-art corpus-based models considering such environments. The models considered are Term Frequency-Inverse Document Frequency (TF-IDF) & Latent Semantic Indexing (LSI), Word2Vec, GloVe, fastText, and Robustly optimized BERT approach (RoBERTa). Most of these models have publicly available pre-trained models/ word embeddings, so our comparison covers both models trained from scratch and pre-trained cases, employing online learning when available. To simulate a constrained environment, we used the small corpus of data our model collected for its training containing snippets of text regarding the 20 most frequent terms in a popular IoT platform and the terms present in the MC dataset Miller and Charles (1991). For evaluation, we used a semantic similarity dataset Antunes et al. (2017) containing 30 word pairs using the same 20 IoT terms .
The remainder of this paper is organized as follows. Section 2 reviews the state-of-the-art in semantic similarity and presents the most relevant methods. The discussion of our previous method and proposed improvements is found in Section 3. Section 4 introduces the models used to benchmark our proposal. The models’ evaluation and discussion is presented in Section 5. Finally, we present our conclusions in Section 6.

2 Background

Semantic similarity is a subtask of semantic relatedness Mohammad and Hirst (2012). In this subtask, two terms are considered semantically similar if there is a synonymy, hyponymy (hypernymy), or toponymy relation between them (examples include doctor–physician and mammal–elephant). To calculate it, there are three major types of semantic measures: i) Knowledge-based, which rely either on ontologies or semantic networks Mohammad and Hirst (2012), ii) corpus-based, which extract the context of the terms from large corpora Yang et al. (2019), and iii) hybrid approaches, that are distributional in nature yet exploit information from a lexical resource.
Knowledge-based approaches rely on handcrafted resources such as thesauri, taxonomies, or encyclopedias as the context to determine the distance between two words. Most previous studies depend on the semantic isA relations in WordNet Miller (1995). WordNet is a manually curated lexicon and taxonomy Yang et al. (2019), where each node represents a fine-grained concept or word-sense, and each edge represents a lexical semantic relationship such as hypernymy or troponymy. One of the earliest and simplest measures is edge counting, proposed by Rada et al. (1989), where the similarity is directly related to the minimum distance between the two nodes. The more edges between the nodes, the more dissimilar they are. Recent approaches take into account the type of relations connecting the nodes. They do this to mitigate the fact that we can reach somewhat similar words even with bigger paths if we consider only hyponymy relations. While with other relations, we can reach unrelated words with a few connections Mohammad and Hirst (2012). Besides WordNet, other Knowledge Graphs are available such as Freebase Bollacker et al. (2008), DBpedia Bizer et al. (2009), YAGO Hoffart et al. (2013), or Probase Wu et al. (2012).
Even so, these solutions can only be applied in languages that have sufficiently developed lexical resources, which are expensive as they require human experts to build them. Furthermore, curating a lexical resource is expensive, and there is usually a lag between the current state of language usage/comprehension and the lexical resource representing it Mohammad and Hirst (2012). For example, the WordNet project no longer accepts comments and suggestions due to funding and staffing issues1.
Corpus-based measures rely on the hypothesis that words in similar contexts tend to be semantically close Padó and Lapata (2007). Instead of lexical resources, they require a large corpus that represents the common usages of the target words. From it, the approaches extract the context of the terms and then use statistics, neural networks, and other methods to analyze word distribution Yang et al. (2019). Statistical approaches include word co-occurrence, point-wise mutual information, and word association ratio Mohammad and Hirst (2012). Neural network based approaches include Word2Vec, fastText, BERT, and ELMo. The best results are obtained with neural networks that employ the recent advances in deep learning (BERT, ELMO, XLNet, and GPT). They are composed of convolutional or transformer layers and are trained in a large corpus using unsupervised tasks such as Masked Language Modeling (MLM) or Next Sentence Prediction (NSP). Even though they are the best, they take a longer time to train and are usually computationally expensive compared to shallow networks (fastText and Word2Vec).
To obtain similarity using these approaches, most statistical and neural network based models map the textual units into a multidimensional space where the distance between two points indicates their semantic similarity Mohammad and Hirst (2012). To measure the distance between points, several distance metrics have been used, such as Euclidean distance, cosine distance, Jensen Shannon distance, soft cosine distance, and word mover Sitikhu et al. (2019), being that cosine distance is the most frequently used.
The knowledge-based approach’s main disadvantage is the cost of creating and maintaining a good lexical resource. Combining it with the versatility of a corpus-based model can mitigate the lack of some information in the lexical resource. With this in mind, many authors found ways to exploit the best of each method and build hybrid models. The authors of Camacho-Collados et al. (2015) used BabelNet Navigli and Ponzetto (2012), a semantic network, to build a corpus on which the corpus-based model learns the vector representations for the textual units. The model uses the Wikipedia pages associated with a given textual unit, in this case, the synset of BabelNet, and all its outgoing links to form the corpus for the specific concept. The Wikipedia pages of the hypernyms and hyponyms of the concept in BabelNet are then used to expand the corpus further.
It is worth mentioning that when the ideal conditions for each type of similarity measure are met, meaning knowledge-based has a good lexical resource, and corpus-based has a large and well-maintained corpus, both types of models obtain good results. Even so, the ever-increasing number of constrained devices in highly dynamic environments (consider IoT or edge computing scenarios) makes it very difficult to obtain those conditions. That led us to propose a DP model that trades accuracy for flexibility and simplicity. Our solution does not require a specialized (large) corpus and learns distributional profiles through public web services.

3 Distributional Profiles

Given a target word u, we use public web services, namely search engines, to gather potentially relevant corpus and build the Distributional Profile of Word (DPW). The profile is built based on proximity, so a word w is only accounted for if it is in the neighborhood of the target word u. Once we have all the relevant words given the corpora built with the search engines results, we can calculate the DPW, which is defined as:
$$\begin{aligned} DPW(u) = { w_{1}, f(u, w_{1}); ...; w_{n}, f(u, w_{n})} \end{aligned}$$
(1)
where u is the target word, \(w_{i}\) is a word that occurs within the neighborhood of u, and f is any strength association metric, in our case, the co-occurrence frequency. The profile can also be interpreted as a point in a high-dimensional space, where each word \(w_{i}\) represents a dimension and \(f(u, w_{i})\) the value in it. Given this, we will refer to words inside a DPW as dimensions from this point onward.
Using this notion, a DPW does the same thing as word embedding algorithms, converting a word into a vector in a high-dimensional space. This means that we can use the same metrics in word embeddings to calculate semantic similarity. We opted for the cosine distance, presented in (2), as the similarity metric for our model because it is invariant to scale, meaning it does not consider the vectors’ magnitude, only their directions. This property is vital for unbalanced corpora, such as those found in M2M scenarios or gathered from search engines (due to their ranking algorithms).
$$\begin{aligned} Similarity(x,y) = \frac{ \sum _{i=1}^{n} x_{i}y_{i} }{ \sqrt{\sum _{i=1}^{n} {x_{i}^{2}}}\sqrt{\sum _{i=1}^{n} {y_{i}^{2}}} } \end{aligned}$$
(2)
Although using public web services to gather the corpus allows us to train DP for scenarios with specific and unusual vocabulary, the profile created can be noisy and contain several dimensions with low relevance (low \(f(u, w_{i})\)). Since the cosine distance is invariant to scale, the combined weight of multiple low relevance dimensions can change the direction of the word vector and damage the similarity score. Additionally, a profile can contain multiple senses of the target word (sense-conflation), which may also change the word vector direction, limiting the potential of this method.
To reduce the DPW’s unwanted dimensions, several filters can be applied. Our original work proposed proposed stemming (minimizing issues with plural words) and a p-value statistical significance test (removing non-significant dimensions). However, other methods such as Pareto Principle and elbow point estimation can be employed.
In this work, we extend our previous proposal to address the issue of sense-conflation. To do so, we used clustering on the DP to identify word senses. The rationale is that dimensions belonging to the same category are closer to each other than words from other categories. The clustering starts with creating a square matrix, which will be used as the dataset for the clustering process. The matrix contains the frequencies of all the words within the profile (see Table 1), where each row represents a DP for the hyper-space defined by the target word profile.
Table 1
Co-occurrence matrix created from a DPW profile
 
\(w_{0}\)
\(w_{i}\)
\(w_{n}\)
\(w_{0}\)
1.0
\(f(w_{0},w_{i})\)
\(f(w_{0},w_{n})\)
\(w_{i}\)
\(f(w_{i},w_{0})\)
1.0
\(f(w_{i},w_{n})\)
\(w_{2}\)
\(f(w_{n},w_{0})\)
\(f(w_{n},w_{i})\)
1.0
These clusters do not represent word senses from a thesaurus as they are conceptually more similar to Latent Semantic Analysis (LSA) categories and may not correspond to our human perception. Consequently, we will refer to clusters as categories from this point forward. This statement assumes that some clusters represent high relevance categories while others represent low relevance ones.
This implication can cause some problems considering that two unrelated target words, u and v, can end up in the same low relevance category, and once we calculate the semantic similarity between them, they are considered similar, generating a false positive. To minimize this issue, our model incorporates an affinity value, the distance between the target word and each category, which can be understood as a bias that measures the word’s natural tendency to be used as the given category. After clustering and computing the affinity of the target word to each cluster, the Distributional Profile of Word Categories (DPWC) is extracted from the DPW and grouped according to the clusters obtained. After being calculated, the affinity values are normalized using (3).
$$\begin{aligned} a'_{i} = \frac{a_{i}}{\sum a_{j}} \end{aligned}$$
(3)
The DPWC is defined following (4):
$$\begin{aligned} DPWC(u) = \left\{ \begin{array}{c} a_{1}; {w_{i}, f(u_{1}, w_{i});...} \\ ... \\ a_{n}; {w_{j}, f(u_{c}, w_{j});...} \end{array}\right\} \end{aligned}$$
(4)
where u is the target word, \(w_{i}\) is a word that occurs with u in a particular category, f stands for co-occurrence frequency, and \(a_{i}\) is the affinity between u and a word category. With the changes made to the model, the similarity between DP had to change to account for them. The updated formula is presented in (5):
$$\begin{aligned} S(u, v) = max\left( cosine(u_{c}, v_{c}) \times \frac{a_{u_{c}} + a_{u_{v}}}{2} \right) \end{aligned}$$
(5)
where \(u_{c}\) and \(v_{c}\) represent a specific category from u and v respectively, and a represents the category’s affinity. The updated similarity measure is the maximum similarity between all possible categories weighted by the average category’s affinity.
One problem with the clustering process is that the created matrix is sparse due to the constraints of our learning process. The only row/column guaranteed to be dense is the one containing the target word u. The remaining rows tend to be sparse as there are no guarantees that word \(w_{i}\) and word \(w_{j}\) (two dimensions from word u) have appeared together in the constrained corpus collected. Our intuition is that the matrix’s 0 coefficients result from the lack of data and do not capture the actual distribution of the co-occurrence.
Following a similar approach as Word2Vec and LSA, we use matrix factorization to reduce the latent dimensions and reconstruct the matrix where the 0 coefficients are replaced with predictions of the actual value. This process creates a dense co-occurrence matrix that provides more information for the clustering algorithm and helps optimize the profile of the target word u.
As we mentioned before, given the constrained corpus, the DP contains a certain amount of noise, \(\epsilon\), which is reduced during the matrix’s values reconstruction. Since the factorization and reconstruction reduce the latent dimensions, the noisy ones are removed, improving the similarity results even without clustering (the original model can easily be swapped by the adjusted one).
A typical factorization method is the Singular Value Decomposition (SVD) applied in LSA and Principal Component Analysis (PCA). However, for non-negative matrices, a better method is Non-Negative Matrix Factorization (NMF) Lee and Seung (1999). This method factorizes the matrix V into two matrices. W and H, with the property that all three matrices have no negative elements. The decomposition is an approximation as presented in (6).
$$\begin{aligned} V \approx WH \end{aligned}$$
(6)
The solution W and H matrices minimize the quadratic error between the original matrix V and its approximation \(V'\). Furthermore, the decomposition is not necessarily unique and depends on the initialization and the optimization method.

3.1 Implementation

Our prototype is divided into six different components, as depicted in Fig. 1.
The first component (corpus extraction) bridges our solution with the search engine to extract a constrained corpus. Our prototype uses USearch2 as the search engine (the developers offered an unlimited plan), but the component can be used with any search engine. The corpus is composed of snippets returned by the search engine. In previous work, we observed that the snippets contained enough information to build reliable DPW’s Antunes et al. (2017). Another remark regarding this component is the caching mechanism implemented that minimizes the search engine’s requests.
The second component (text processing) implements a preprocessing pipeline to clean the corpus and divide it into tokens. The pipeline is divided into four steps, presented in Fig. 2. First, the snippets are tokenized, then the resulting tokens are filtered using a stop word, non-alphanumeric, and small token filters. The removed words are deemed irrelevant because they frequently occur in the language and provide little information. We considered a small token, any token with less than three characters. The component was implemented using the NLTK library3.
Using the processed text, we extract the DPW profile, which is then analyzed by the DPW optimization component. This component creates the co-occurrence matrix and factorizes it using the NMF, which was implemented using NumPy4 based on the Multiplicative Update from Lee and Seung (1999) with imputation (necessary to deal with the missing values) and regularization to prevent over-fitting. After the factorization, the matrix is reconstructed, and from it, we extract an adjusted DPW and a distance matrix for the clustering algorithm. As discussed previously, the NMF does not have a single solution for factorization. The standard solution to this problem is running several factorizations and choosing the best one. In our prototype, we restricted the initial random seed to have a consistent matrix initialization, as our focus is to validate the positive impact of latent methods on a constrained corpus.
Looking at the clustering component, we used the Hierarchical Clustering (with average distance). The disadvantage of these algorithms is the need to define the number of clusters a priori. Since we did not have an ideal number of clusters, we used the Silhouette Coefficient to estimate the correct number of clusters. The Hierarchical Clustering was implemented with the Scikit-learn5.
Finally, the DPWC component uses the reconstructed co-occurrence matrix and the clusters to return the DPWC(u) of the target word. This component also computes the affinity between the target word and each cluster.

4 Models

To evaluate the new proposal, different word embedding algorithms were implemented and tested. The algorithms range from simple statistical measures like TF-IDF with LSI to deep learning models like Bidirectional Encoder Representations from Transformers (BERT). Given that the training corpus was small and algorithms like BERT usually need a large corpus of data to obtain significant results, pre-trained models were also considered a possibility.
Since the models returned word embeddings, to compare the similarity between words, the cosine distance between the embeddings vectors was calculated as seen in (2).

4.1 TF-IDF & LSI

TF-IDF stands for Term Frequency-Inverse Document Frequency and evaluates the importance of a word in a document given a corpus. The word’s significance is proportional to the number of times the word appears in the document but is compensated by the word’s frequency in the corpus.
To calculate the TF-IDF weight, we need to consider two terms: the Term Frequency (TF), presented in (7), which is the number of times the term, t, appears in document i, \(\vert D_{i}(t)\vert\), given the number of words in it, \(\vert D\vert\), and the Term Frequency (TF), presented in (8), which is the logarithm of the number of the documents in the corpus, \(\vert C\vert\), divided by the number of documents containing t, \(\vert C(t)\vert\). Once the two terms are calculated to obtain the TF-IDF weight of the term, we need to multiply them, as seen in Equation 9.
$$\begin{aligned} TF(t) = \frac{\vert D_{i}(t)\vert }{\vert D\vert } \end{aligned}$$
(7)
$$\begin{aligned} IDF(t) = \log _{e}\left( \frac{\vert C\vert }{\vert C(t) \vert }\right) \end{aligned}$$
(8)
$$\begin{aligned} TF-IDF(t) = TF(t)\times IDF(t) \end{aligned}$$
(9)
The TF portion of the equation measures how important the term is in a given document. Since the term is more likely to be repeated in a bigger document, the number of times it appears is divided by the number of words in the documents to normalize it. As for the IDF, its objective is to measure how important the term is in the corpus. Since all words are equal during TF calculation, the IDF portion provides a way to reduce the weight of frequent words such as “is” or “that”.
However, as we said, TF-IDF gives the relevance of a word, not a word embedding. To obtain something similar to word embedding, we use LSI. The LSI uses SVD to identify patterns between terms and concepts in a corpus. The idea is that words used in the same context have similar meanings even if the words are different. After applying LSI to a word, we receive a vector where each index can be considered a given topic, and the value of that position is the relevance of that word in that topic. The number of topics considered can be adjusted, and similar words should obtain similar relevance in the different topics considered.
The first step in obtaining word similarity using this method is to train the TF-IDF model. Then the LSI model is trained using the TF-IDF indexing. With both models trained, the topic vectors of the words can be obtained, and to know the similarity between words, one can calculate the cosine distance between their topic vectors. The Gensim library6 was used in the experiments developed as the TF-IDF and LSI algorithms implementation.

4.2 Word2Vec

The Word2Vec is a machine learning algorithm that transforms words into vectors as the name gives away. The algorithm uses a neural network and a large corpus of unannotated text to learn word association.
To train the model, the algorithm solves one of two artificial tasks. It either solves the skip-gram, where the model receives a word as input and predicts the surrounding words in the surrounding window, or the Continuous Bag-of-words (CBOW), where the model predicts the central word given the surrounding window. According to the original paper Mikolov et al. (2013), skip-gram is better for semantic tasks, and CBOW is better for syntactic tasks and trains faster.
As seen in Fig. 3, the network itself is shallow, only containing three layers. Using the skip-gram as an example, the input layer receives a one-hot vector representing the input word. The second layer is used to create the latent space we will use as the word embeddings. The last layer, the output, outputs the predicted probability distribution of nearby words to the input. Once the model is trained, the last layer is ignored, and we use the latent space embeddings resulting from the projection between the input layer and the hidden layer as the embedding vectors.
Two factors that significantly influence the algorithm’s performance are the context and word vector sizes used. Usually, the model’s performance increases with the increase of the word vectors, but after some point, the improvements become marginal. The context size defines the number of words used before and after the central word as context. The recommended value for the context size in the skip-gram is five, while the CBOW is ten. As for the word vectors, the usual size ranges from 100 to 1000.
Given the models, simplicity, and results, it is a prevalent solution in different areas, being applied in machine translation, recommendation systems, and automatic text tagging.
Its popularity also resulted in the proposal of many off-the-shelf implementations with pre-trained models in a large corpus of text made available for them. The Gensim library was used in the experiments developed as the Word2Vec algorithm implementation. In this implementation, the model also supports online training, a significant advantage given the highly dynamic environments being considered. Even so, Gensim only provides pre-trained vectors on a large corpus, not pre-trained models. This inhibits us from using the online training to further train on a specialized dataset.

4.2.1 Pre-Trained

The pre-trained vectors used are the word2vec-google-news-300, trained on a subset of the google news dataset containing word embeddings with a size of 300 for 3 million words. As we will later see, the used size for the embedding vector is above the ones we used on models trained from scratch. Even so, there is still the possibility to conclude from the results obtained.

4.2.2 Untrained

As for the models trained from scratch, we used most of the default values, running five epochs for each one with a minimum occurrence of one for each word and varying context and embedding sizes mentioned in detail in the evaluation.

4.3 GloVe

Similar to Word2Vec, GloVe Pennington et al. (2014) is also a machine learning algorithm that generates word embeddings. The difference lies in the task they solve, as GloVe is an unsupervised algorithm that focuses on word-word co-occurrence to extract meaning instead of solving the skip-gram or CBOW.
The model is a global log-bilinear regression model that combines the advantages of the global matrix factorization and local context window methods Pennington et al. (2014). The first step in training this model is to construct a word-word co-occurrence matrix which can be done with a single pass through the corpus. Then using this matrix and ignoring the non-zero entries, the algorithm learns word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence Pennington et al. (2014)
In this model, we can also define the context size to use when building the word-word co-occurrence matrix and the embeddings vector size. The context’s typical size is 15, and the embeddings vector size ranges from 25 to 300.
In its original paper Pennington et al. (2014), the GloVe shows better performance than Word2Vec and other models in word analogy and similarity and entity recognition. Even so, the corpus used to obtain the results is bigger than what we expect in our environment, leaving the question of what model performs best with a smaller corpus.
To train and evaluate the GloVe, we will use the actively maintained library of Stanford7.

4.3.1 Pre-Trained

For the pre-trained solution, we had 4 options:
  • Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, and 300d vectors);
  • Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors);
  • Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors);
  • Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, and 200d vectors).
From all of them, the first option presents the text snippets most likely to be composed of complete and correct sentences. Even though it has the smallest number of tokens, the model already comes trained with different vector sizes, which is also evaluated during experiments.

4.3.2 Untrained

While training the model from scratch, the focus is on evaluating the effect of the different context windows and embeddings vector sizes. With this in mind, we kept most default values using one as the minimum word repetitions and 25 as the number of epochs.

4.4 FastText

FastText Bojanowski et al. (2017) is a word embedding model proposed by Facebook AI8. It differs from the previous models because it focuses on the word’s morphological structure to extract meaning. This information is essential in morphologically rich languages such as German, where a single word can have many morphological forms that rarely occur in the corpus.
To analyze the word’s morphological structure regardless of language, the model transforms the words in their n-grams, as seen in Fig. 4. During training, the model creates vectors representing the n-grams. The word vector is the sum of the n-grams vectors. This feature allows the model to create embedding vectors even for words unseen during training and keeps the training of the model simple.
Like Word2Vec, to train fastText, the model uses the CBOW or Skip-gram strategy. However, its straightforward implementation trains faster than Word2Vec Bojanowski et al. (2017).
The fastText implementation used is the one provided by Facebook AI as an open-source project9.

4.4.1 Pre-Trained

On the official website of fastText, there are some options for pre-trained models. They are based on Wikipedia, UBMC web base, statmt.org, or a common crawl through websites.
For the same reason mentioned in the pre-trained model of GloVe in sub-subsection 4.3.1, we opted for the Wikipedia-based model, which contained 1 million word vectors and a vector size of 300.

4.4.2 Untrained

In the model trained in the custom dataset, we decided to keep most default values changing only the vector and context size as we did in the previously mentioned models. The model was trained during five epochs and had a minimum word frequency of one.

4.4.3 Online Trained

Besides using the pre-trained model or training from scratch, fastText also supports online training. In this setting, we can use the pre-trained vectors for the words instead of cold starting a model. To evaluate the benefit of the online training, we used the vectors of the pre-trained model as the base for a new model trained on our corpus.

4.5 BERT

Our solution will be compared with one more model, a variant of the BERT Devlin et al. (2019), the RoBERTa Liu et al. (2019). This model was chosen as a representative model for the Deep Learning approaches.
BERT’s original model uses the well-known transformer architecture Vaswani et al. (2017), which we will not review in detail. The key difference between the original Transformer’s architecture and BERT’s is the use of a bidirectional transformer encoder and the pre-train task used. The pre-training task is MLM, which enables the representation to fuse the left and right context, allowing the pre-training of a deep bidirectional Transformer Devlin et al. (2019).
The RoBERTa model was proposed as there were several possible improvements to the BERT pre-training procedure. The improvements made to the base model were:
  • training the model longer, with bigger batches, over more data;
  • removing the next sentence prediction objective;
  • training on longer sequences;
  • dynamically changing the masking pattern applied to the training data.
Training these models is computationally expensive Liu et al. (2019), as they have many trainable parameters (\(BERT_{Large}\) has 355M) and are divided into two steps. The first is the pre-training step which we already mentioned. In this step, the model goes through a large corpus of unannotated text, solving an unsupervised task. The task in RoBERTa is the MLM, where the model slides through the corpus, and in each window, a percentage of the tokens are randomly masked. The model then predicts the masked tokens. After the pre-training, the model goes through a supervised training step for the specific task they will solve. This step is crucial as it was shown that the results are considerably worse without it. The results also show that if we have a diverse enough training set, we can overcome the lack of in-domain training data.
The problem with these models is that the bigger they are, the better their performance Liu et al. (2019). This leads the researchers to build bigger models trained on larger corpora during multiple epochs, which is not feasible when considering highly dynamic environments with limited computational power.
With this in mind, we decided to evaluate a deep learning model on semantic similarity, considering a small corpus for pre-training and a small supervised training set, analyzing the model’s performance obtained when changing the word embeddings vector size and during the multiple training steps.
The pre-training of the models was done using a script provided by a tutorial in the SBERT library10, while the task training was developed using the library itself.

4.5.1 Pre-Trained

Given the corpus size required for pre-training a BERT model, one of the most frequent approaches is to use a model with the pre-training step already performed. This is viable as the pre-training is supposed to learn general word embeddings of the language that will later be specialized for the task at hand.
The pre-trained model used is the pre-trained model proposed in the original paper, which was trained in the BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories. The total corpora used contains more than 160 GB of uncompressed text, much bigger than the one we will use to train the model from scratch.
In the Hugging Face Transformers, this model is named “roberta-base”11. After downloading the model, we evaluated it without training it in a specific task. Even though we do not expect good results, it will be helpful to understand whether there is a difference between pre-training on our corpus or a bigger one.

4.5.2 Untrained

Considering highly dynamic and specific environments as the end application of the models, the general corpora used to pre-train RoBERTa might not be as relevant as a smaller specialized corpus. To evaluate this, we trained RoBERTa from scratch using our specialized and smaller corpus. So that the corpus was the only varying factor, we evaluated the model using only the pre-training.
After this evaluation, the model was also trained on the downstream task of word similarity through the Siamese BERT Architecture, which obtains the word embeddings regarding the two textual units and calculates their similarity using the cosine distance. The network corrects the word embeddings and learns similarity by comparing the cosine distance with the supposed similarity score.

4.5.3 Online Trained

The online setting is the most frequently used. In this setting, the pre-trained “roberta-base” is trained on a downstream task, and after that, it is ready for usage in a production environment. By comparing this model with the ones trained from scratch and the pre-trained one, we can conclude what a deep learning model needs to be viable in a constrained and dynamic environment.

5 Evaluation

The evaluation of the models is divided into two phases. First, as we mentioned when describing the models, we evaluate the model performance when changing internal parameters such as the embeddings’ vector or the context window size. Then using the best combination for each model, we will compare them with each other.
To evaluate the models, we will use a correlation metric. The objective is to understand how related the models and human predictions are. The correlation r can range from -1 to 1, where -1 indicates a perfect negative linear relationship, 0 indicates no linear relationship, and 1 indicates a perfect linear relationship. The best model has the highest correlation as it closely mimics human judgment. The correlation metric used is Pearson correlation coefficient (PCC), as seen in (10). Besides being a frequently used metric in semantic similarity, it is independent of scale and distance metrics. The rationale is that even on different scales, if the linear correlation between the ground truth and the model is high, the performance is also high.
$$\begin{aligned} PCC(x, y) = \frac{\sum (x_{i}-\bar{x}) \times (y_{i}-\bar{y})}{\sigma _{x} \times \sigma _{y}} \end{aligned}$$
(10)
The experiment was run in a virtual machine with 24 vCPUs, 32 GBs of ram, and an NVIDIA RTX 2080. The models were implemented using public python libraries except for fastText and GloVe, which used the creator’s binary file. The experiment details are presented in Table 2.
Table 2
Experimental settings
Libraries
Word2Vec
TF-IDF
GloVe
fastText
RoBERTa
Gensim\(^{6}\)
Gensim
GloVe\(^{7}\)  
fastText\(^{9}\)
sBERT\(^{10}\)  
Machine Specs
vCpu Cores
Ram
GPU
24
32 GBs
Nvidia 2080
Datasets
 
Scenario Specific
General
Training
IoT Semantic - Corpus\(^{12}\)  
WordSimilarity-353 - Similarity Finkelstein et al. (2002)
Testing
IoT Semantic - Similarity Antunes et al. (2017)
Miller-Charles - Similarity Miller and Charles (1991)

5.1 Datasets

To train and evaluate the models, we need data. The first step in choosing the data used was deciding the corpus of unannotated text the models need when training. Since we are considering a scenario with a restrained corpus, we opted for using the corpus collected by our model for every model trained from scratch. The corpus used is available at kaggle12 containing 2435 files (each file representing a request to the search engine API) with a total size of 150MB. Besides the corpus, the RoBERTa also needs a supervised dataset for the second training. The dataset chosen was the WordSimilarity-353 Finkelstein et al. (2002) Test Collection which contains 353 word pairs whose similarity was assessed by 29 subjects.
Having the training data sorted, we need to decide the evaluation data we will use. We opted for the Miller-Charles (MC) dataset Miller and Charles (1991), as it is one of the reference datasets for semantic similarity evaluation. It is composed of 30 word-pairs rated by 38 subjects. The word pairs are rated on a scale from 0 (no similarity) to 4 (perfect synonymy). Although the MC dataset is a good evaluation dataset, its word pairs are general knowledge, not specific to the scenarios we consider. To evaluate the models in word pairs of the target scenarios, we designed a semantic similarity dataset for IoT scenarios Antunes et al. (2017). To build the dataset, we mined a popular IoT plataform13 extracting the 20 most commonly used terms (ranked by term frequency). Five fellow researchers organized these terms in 30 word-pairs and rated them on a scale from 0 to 4. Even though our dataset is not as comprehensive as the MC, it still reaches a 0.8 correlation amongst human classifications.
By reducing the amount of data seen during training (only 150 MB versus hundreds of GBs of data) and using an evaluation dataset with words specific to IoT. We are creating a constrained scenario focused on evaluating models’ performance regarding IoT linguistics. Furthermore, the training dataset was constructed based on web pages containing the words seen in the IoT evaluation dataset, so it is also directed to IoT instead of general linguistics.

5.2 TF-IDF & LSI

While evaluating this model, we experimented with multiple context window and topic vector sizes. The values used here were the same we applied in our model and are presented in Table 3.
Table 3
Context window and word embedding vector sizes used
Miller-Charles dataset
   
Context window
3
5
7
Word embedding
43
72
112
IoT dataset
   
Context window
3
5
7
Word embedding
50
88
125
Looking at the IoT, it is clear that the model benefits from the extra topics and larger context window, as the best solution, with a PCC of 0.54, is the one with context window size 7 and topic vector size 125. Regarding the MC dataset, the results show the opposite as the best model, obtaining a PCC of 0.42, is the one with the smaller context window, 3, and topic vector size, 43. This is justifiable given our constrained and focused corpus. Since our corpus is directed for the words in the IoT dataset, when evaluating it, the model uses the extra topics to improve the characterization of those words. As for the MC dataset composed of general knowledge words, the extra topics harm the performance as the model uses unrelated topics for that word in the word vector.

5.3 Word2Vec

Similar to LSI, during the Word2Vec evaluation, we experimented with different context window and embedding vector sizes. The values used were the same as those used by the LSI for the same reasons (see Table 3). Besides that, we also evaluated the use of pre-trained word embeddings to understand if there were benefits in collecting the corpus and training the model.
Analyzing the results in the IoT datasets, we can state two things. The Word2Vec benefited from the bigger context window and embedding vector as the model obtained the best results with context window size 7 and embedding vector size 125, achieving a PCC of 0.6. Furthermore, we can also state that even with larger embedding vectors (300 vs. 125), the pre-trained vectors obtained a worse performance, 0.53. Besides the worse performance, it is also important to state that the pre-trained vectors lacked vectors for some of the words, showing the importance of having a corpus built for these specific environments.
As for the MC dataset, the model continues to benefit from the bigger context window and embedding vectors. The model obtained the best results with the context window size of 7 and the embedding vector size of 112, achieving a PCC of 0.64. Even so, here, the pre-trained vectors showed why they are commonly used as they obtained a better performance, achieving a PCC of 0.78. One important thing to remember is that the pre-trained vectors have more than double the size of the ones we trained, which can significantly help their performance, as the model shows that it benefits from larger word embedding vectors.

5.4 GloVe

The GloVe evaluation followed the same reasoning as the one done in Word2Vec. We experiment with different context window and embedding vector sizes (see Table 3) and evaluate some pre-trained vectors. The only difference here is that we have pre-trained vectors with multiple sizes.
In the IoT dataset, the GloVe saw a slight improvement when increasing the size of the context window and embedding vectors, achieving a PCC of 0.51 with the smallest combination and 0.56 with the biggest. When looking at the pre-trained vectors, the same thing happened. As the embedding vectors increased, the model would improve, starting with 0.2 PCC for a vector size of 50 and achieving a PCC of 0.35 with a vector size of 300. Similar to the Word2Vec, besides the pre-trained vectors obtaining worse results than those trained from scratch, some words did not have pre-trained vectors. In the MC, the results were similar. As the embedding vector size increased, so did the model score, achieving in the best case a PCC of 0.56 when trained from scratch and 0.74 when using pre-trained vectors. Even though the embedding vectors have different sizes, 112 for the model trained from scratch and 300 for the pre-trained, when comparing the pre-trained model with an embedding vector size of 100 with the best model from scratch, the pre-trained model also obtains better results, achieving a PCC of 0.63.

5.5 FastText

When evaluating fastText, we experimented with different context window and embedding vector sizes (see Table 3) and evaluated the benefit of using pre-trained vectors as we have done in the previous models. Nevertheless, given the opportunity to do online training, we experimented with it, using the pre-trained vectors to initialize the model’s weights and then train it in our corpus.
In the IoT dataset, the model benefited from the bigger embedding vectors, slightly improving the results obtained as the size increased, achieving a PCC of 0.63 when using the bigger word embedding vectors and context window. The performance of the pre-trained vectors was much worse, obtaining a PCC of 0.31. However, after applying online training, the performance improved, achieving the same result as the model trained from scratch (PCC of 0.63). Similar to Word2Vec and GloVe, there were no pre-trained vectors for some words.
The model followed the same direction regarding the MC dataset, obtaining better results the bigger the word embedding vectors were. The best model trained from scratch obtained a PCC of 0.59, which is inferior to the one obtained by the pre-trained vectors as they obtained a PCC of 0.81. When analyzing the online training, a curious thing happened: the model performance was worse, achieving a PCC of 0.6. The worst performance might be caused by the restrained corpus not containing sufficient instances of some words, leading the model to generate incorrect representations.

5.6 RoBERTa

The RoBERTa model was a more complex example to evaluate as we evaluated more training settings. We evaluated the model with only pre-training and with specific task training. We also tested the model using different word embedding vector sizes. These sizes, presented in Table 4, were slightly different from the ones used in the other models as they needed to be multiple of 12.
Table 4
Context window and word embedding vector sizes used for RoBERTa
Miller-Charles dataset
   
Context window
3
5
7
Word embedding
36
72
108
IoT dataset
   
Context window
3
5
7
Word embedding
48
84
120

5.6.1 Models w/ Corpus Training

Starting with the models using only the pre-training in the IoT dataset, the RoBERTa model follows the same rationale as the remaining models. The bigger the word embedding vector, the better the performance, achieving a PCC of 0.32 when using a vector size of 120. As for the pre-trained model, even having a bigger embedding vector size, 768, the performance was worse as it only obtained a PCC of 0.03 predicting most words as similar. In the MC dataset, the results were unclear as the best performance for the models trained from scratch was obtained by the vectors with size 72 (PCC of 0.1), and the bigger and smaller vectors obtained the same performance (PCC of 0.02). The pre-trained model obtained better results, even though they were still bad (PCC of 0.14).

5.6.2 Models w/ Task Training

The tasked training models obtained interesting results in the IoT dataset. The best model was the one with embedding vector size 80, and its performance, PCC of 0.12, was worst than the models with no task-specific training. The model obtains relevant results using the pre-trained model and applying task-specific training, achieving a PCC of 0.58. In the MC dataset, the results change drastically as the model obtains near-perfect performance when it has access to task-specific training, obtaining the best performance, PCC 0.94, with the bigger embedding vectors with size 108. Following the results obtained in the IoT dataset, the online model also obtains the best performance in this dataset, achieving a PCC of 0.95.
Comparing all the models obtained, we can conclude that to obtain good results, the RoBERTa model needs a reasonable corpus for pre-training and a labeled dataset of the target domain for the task training. If the labeled dataset is not directed for the domain, the model will obtain better results with just the pre-training. If the corpus dataset is directed to the target domain, the pre-trained model will obtain better results even when its size is a fraction of another general-purpose corpus. With the combination of both datasets, a better approach is to have a general-purpose corpus even when the labeled dataset is not directed to the target domain.

5.7 DPW & DPWC

To evaluate the benefit of the different improvements and better understand the best hyper-parameters for the model, we experimented with different context window sizes and dimensionality reduction factors. The values used are presented in Table 5.
Table 5
Context window sizes and dimensionality reduction sizes used
Context window
3
5
7
Reduction factor
1
2
3
4
5
Regarding the MC dataset, the models with a neighborhood size of seven obtain the best performance. The same happens for the models using NMF for dimensionality reduction, both with and without clustering, even though there is no ideal value for the reduction factor. The models also benefit from clustering as they perform better than the original ones regardless of neighborhood size (when not applying NMF). When applying NMF, the clustering models obtain the best performance, PCC of 0.66, using a reduction factor of 2 and a neighborhood size of 7, an improvement of 0.22 over the best original model.
Analyzing the second dataset, the impact of factorization is less visible when applied directly to the original model, only improving the results when using a neighborhood size of 3. The clustering obtains similar results when implemented alone, scoring worse than the original model. When combining both improvements, the model obtains better results than any original model scoring a PCC of 0.55, an improvement of 0.07 over the best original model.

5.8 Comparison

After analyzing the results individually, we need to compare them to understand what is the best option depending on the scenario we consider. The paper’s focus is constrained and highly dynamic scenarios, so the model’s performance is not enough for a good comparison. For a comprehensive evaluation of the models considering constrained and highly dynamic scenarios, we decided to record the models’ size and training/prediction time.

5.8.1 Models Performance

To fairly evaluate the DPWC, we start by comparing it with the other models when trained from scratch. Figures 5 and 6 represent the results obtained by each model’s best combination of parameters in the two datasets considered. Considering the word-pairs in each dataset, the MC can be considered a more general-purpose dataset, while the IoT represents a specific domain. When observing the results in the MC dataset, it is clear that the RoBERTa with task training is the best model, followed by the DPWC. In the IoT dataset, the results are slightly different as the best model is the fastText, closely followed by RoBERTa with just pre-training in our corpus and the DPWC.
Analyzing these results, the DPWC might not seem relevant. However, in the MC dataset, the best model, RoBERTa, needed a corpus and a labeled dataset to obtain the best results. Besides, its performance significantly decreased when applied to a more specific dataset. Regarding the remaining models, fastText is the second-best model, maintaining consistent results between datasets. Even so, when considering a changing environment, fastText requires the retraining of the model to add new words, which is an expensive step that our model removes.
That said, when comparing the DPWC with other available solutions, our model either obtains better results overall, has much cheaper data collection, or is better adapted for dynamic and constrained environments.
Nonetheless, using pre-trained models and word embeddings is a possible approach. Figures 7 and 8 compare our model with the best results obtained by the pre-trained models in the MC and IoT datasets. Starting with the MC dataset, we can see how the models’ performance, excluding RoBERTa’s, improves thanks to the more considerable general-purpose corpus obtaining better results than the DPWC. Even so, when considering the IoT dataset, the pre-trained models perform poorly as the domain-specific words do not appear regularly in the corpus used.
The online models obtain interesting results. The fastText in the MC dataset obtains worse performance while in the IoT improves the pre-trained model. Regarding RoBERTa, the model always benefits from task-specific training.
Even so, given the inability to adapt to the different contexts, the pre-trained and online models are more inconsistent than DPWC, showing that besides being a more lightweight approach, our model is a better approach than pre-trained word embeddings in constrained and dynamic environments.

5.8.2 Execution time & Model Size

During the models’ evaluation, we experimented with multiple hyperparameters. These hyperparameters influence not only the performance of the model but also the time the model takes to train/predict and its size. Similar to how we compared the performance of the models when comparing the execution time and model size, we will use the results obtained for the best-performing models. The hyperparameters of the models considered in this evaluation are presented in Table 6. Each model was trained and evaluated ten times for a fair evaluation of training/prediction time.
Table 6
Context window sizes and word embedding sizes of the models used for model size and execution time comparison
Model
Context Window
Word Embeding/Reduction Factor
Online
DPWC
7
2
-
LSI
7
125
-
Word2Vec
7
125
N
GloVe
7
125
N
fastText
7
125
N
RoBERTa - Corpus
-
84
N
RoBERTa - Task
-
-
Y
When proposing models for highly dynamic environments, we want models that can quickly be trained and set forth for production. Lower training times will compensate for the fast decrease in performance caused by the environmental changes. In Fig. 9, we present the models’ training time. Here the DPWC provides the lowest training time, four times faster than the following model (fastText). Compared to the remaining ones, RoBERTa presents an enormous training time even when using a smaller model without task-specific training, taking 15 times to train compared to GloVe and more than 200 compared to the DPWC.
Since the devices have low computational power in most IoT environments, we need models that adapt to such constraints. The time a model takes to predict is an essential indicator of applicability. Models with longer prediction times cannot be applied to such scenarios as they would take too long to predict. Figure 10 analyzes the models’ prediction time. As we can see, the fastest model is GloVe, followed closely by Word2Vec and later LSI, being orders of magnitude faster than the remaining ones. These outcomes result from such models having all word vectors generated and loaded into memory. When similarity needs to be calculated, these models search for the word vectors and calculate their similarity. If we could not load every vector into memory, these models would take far longer to predict as they would have to load different files constantly. Finally, we have DPWC, fastText, and both RoBERTa’s models, which calculate the word vectors on the fly. In this scenario, DPWC is much faster than the remaining models.
In constrained scenarios, another frequent problem is the lack of disk space. The more space is reserved for the ML model, the less it can be used to store relevant data. In Fig. 11, we compared the space occupied in the disk by the different models. In this analysis, fastText is the model that occupies the most disk space, followed by GloVe and RoBERTa. These models occupy a considerable amount of storage if they were to be placed in an IoT device, so they would not be feasible. Considering the remaining models, the smallest one is DPWC, 36 times smaller than (Word2Vec) the second smaller model.
Considering the performance, time to train and predict, and the disk space used, the DPWC model presents the best option. The DPWC presents the lowest training time and disk space usage by a significant amount and a good prediction time with good performance results. Besides, the high adaptability embedded in its design has even more advantages, possibly allowing more prolonged deployment periods without needing retraining or any maintenance.

6 Conclusion

Semantic similarity models can be used to extract, organize, or cluster data based on concepts instead of string matching. This makes them ideal for organizing and optimizing the various devices in 5G and next-generation networks Mukhamediev et al. (2017), Calvanese Strinati and Barbarossa (2021).
In this paper, we reviewed the commonly used semantic similarity models, discussed the limitations of our previous model, explored latent methods (through matrix factorization) and clustering to improve the model, and compared it with some corpus-based approaches. Since distributional profiles extracted from web services may contain noisy dimensions and several senses of the target word (sense-conflation), we implemented dimensional reduction and clustering to minimize these issues.
Using the Miller-Charles dataset and an IoT semantic dataset, our solution was evaluated against TF-IDF & LSI, Word2Vec, GloVe, fastText, and BERT. To build the corpus for the models, we used a single search engine (USearch) that is not well known and might lack several significant results shown in other search engines. Compared to the other models in the IoT dataset, our model presents similar performance with a maximum decrease of 0.08. When considering models trained with the limited corpus in the MC dataset, our model obtains better results than 5 out of the 6 models, with RoBERTa obtaining the best performance. Even so, our model obtains these similar results with a fraction of the computational cost. Besides that, it automatically adapts to new concepts, while most of the other models require updating the corpus of the model and retraining, becoming much more expensive.
By comparing the results obtained in the MC and IoT datasets using pre-trained models, we could also state that these models could not capture the meaning of the words in the more specific domain as the results significantly decreased, and there were missing embeddings.
That said, we concluded that the results obtained by our model are relevant as it obtains similar performance to much more complex and computationally expensive models that require a pre-built corpus, longer training periods, and in some cases, labeled data. Furthermore, the results obtained by our model can be improved by using hypernyms to learn more abstract dimensions.

Acknowledgements

This work is funded by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020-UIDP/50008/2020.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

A: Tables with the results obtained

Tables 7, 8, 9, 10, 11, 12, 13, 14.
Table 7
Results obtained by our model. (using PCC)
Context
Reduction
DPW
DPW
DPWC
DPWC
window size
factor
 
latent
 
latent
Miller-Charles dataset
 
1
 
0.35
 
0.30
 
2
 
0.41
 
0.39
3
3
0.30
-0.1
0.35
0.14
 
4
 
0.53
 
0.40
 
5
 
0.37
 
0.33
 
1
 
0.47
 
0.54
 
2
 
0.47
 
0.55
5
3
0.34
0.05
0.43
0.44
 
4
 
0.37
 
0.49
 
5
 
0.43
 
0.54
 
1
 
0.53
 
0.66
 
2
 
0.49
 
0.66
7
3
0.41
-0.07
0.57
0.59
 
4
 
0.46
 
0.63
 
5
 
0.54
 
0.63
IoT dataset
 
1
 
0.39
 
0.36
 
2
 
0.39
 
0.36
3
3
0.4
0.14
0.18
0.30
 
4
 
0.30
 
0.26
 
5
 
0.45
 
0.33
 
1
 
0.42
 
0.48
 
2
 
0.44
 
0.48
5
3
0.24
0.24
0.40
0.55
 
4
 
0.44
 
0.51
 
5
 
0.47
 
0.48
 
1
 
0.41
 
0.45
 
2
 
0.42
 
0.45
7
3
0.48
0.43
0.34
0.49
 
4
 
0.47
 
0.38
 
5
 
0.47
 
0.44
Bold highlights the best DPW, DPW with latent, DPWC and DPWC with latent models for each dataset
Table 8
Execution time and disk space for our model with reduction factor of 2. (using seconds and Megabytes)
Context
Training
Inference time
Model Size
window size
Time
  
Miller-Charles dataset
3
\(36.7404 \pm 0.29\)
\(0.007 \pm 0.0018\)
0.09567
5
\(63.3268 \pm 0.21\)
\(0.061 \pm 0.0016\)
0.29339
7
\(93.861 \pm 0.48\)
\(0.172 \pm 0.0024\)
0.63123
IoT dataset
3
\(27.0937 \pm 0.25\)
\(0.014 \pm 0.001\)
0.068104
5
\(46.8229 \pm 0.3\)
\(0.092 \pm 0.003\)
0.206808
7
\(64.3312 \pm 0.32\)
\(0.252 \pm 0.001\)
0.42091
Entries in bold highlight the best-performing hyperparameter set for each model
Table 9
Results obtained by the models trained from scratch. (using PCC)
Context window size
Embedding vector size
LSI
Word2Vec
GloVe
fastText
RoBERTa - Corpus
RoBERTa - Task
Miller-Charles dataset
 
3
43
0.42
0.46
0.42
0.47
0.02
0.78
5
75
0.26
0.56
0.44
0.53
0.10
0.87
7
112
0.27
0.64
0.56
0.59
0.02
0.94
IoT dataset
 
3
50
0.09
0.54
0.51
0.58
0.05
0.19
5
88
0.37
0.52
0.55
0.60
0.14
0.12
7
125
0.54
0.60
0.56
0.63
0.32
-0.03
Bold highlights the best hyperparameter set for each model in each dataset
Table 10
Results obtained by the models using pre-trained vectors. (using PCC)
Word2Vec
GloVe
fastText
RoBERTa
 
50
100
200
300
pre-trained
online
pre-trained
online
Miller-Charles dataset
  
0.78
0.48
0.63
0.70
0.74
0.81
0.60
0.14
0.95
IoT dataset
  
0.53
0.20
0.29
0.31
0.35
0.31
0.63
0.03
0.58
Entries in bold highlight the best-performing hyperparameter set for each model
Table 11
Execution times obtained by the models trained from scratch. (in seconds)
Context window size
Embedding vector size
LSI
Word2Vec
GloVe
fastText
RoBERTa - Corpus
RoBERTa - Task
Training Time
 
3
43
\(454.53 \pm 5.39\)
\(344.24 \pm 1.67\)
\(269.39 \pm 3.76\)
\(94.54 \pm 1.49\)
\(13165.70\pm 59.12\)
\(13190.90 \pm 59.30\)
3
50
\(440.94 \pm 13.71\)
\(343.44 \pm 2.29\)
\(302.45 \pm 2.07\)
\(99.49 \pm 1.93\)
\(13297.32 \pm 48.82\)
\(13322.42 \pm 48.9\)
5
75
\(484.21 \pm 6.09\)
\(551.25 \pm 1.69\)
\(586.21 \pm 2.76\)
\(158.91 \pm 0.57\)
\(13833.16 \pm 42.34\)
\(13858.2 \pm 42.6\)
5
88
\(485.90 \pm 0.52\)
\(514.05 \pm 1.77\)
\(626.69 \pm 2.93\)
\(167.11 \pm 1.24\)
\(14137.54 \pm 77.70\)
\(14162.6 \pm 77.9\)
7
112
\(609.22 \pm 7.27\)
\(704.42 \pm 1.28\)
\(920.22\pm 6.23\)
\(235.30 \pm 0.56\)
\(14541.48 \pm 66.46\)
\(14566.7 \pm 66.6\)
7
125
\(658.17 \pm 9.94\)
\(936.76 \pm 2.66\)
\(975.03 \pm 6.04\)
\(253.80 \pm 0.97\)
\(14539.6 \pm 50.95\)
\(14755.5 \pm 51.3\)
Prediction Time (For both datasets)
 
3
43
\(0.09 \pm 0.02\)
\(0.00091 \pm 0.00002\)
\(0.00074 \pm 0.00017\)
\(0.78185 \pm 0.02800\)
\(2.620 \pm 0.020\)
\(2.820 \pm 0.023\)
3
50
\(0.084 \pm 0.001\)
\(0.00092 \pm 0.00001\)
\(0.00068 \pm 0.00002\)
\(0.80493 \pm 0.01143\)
\(2.700 \pm 0.021\)
\(2.788 \pm 0.023\)
5
75
\(0.086 \pm 0.001\)
\(0.00093 \pm 0.00002\)
\(0.00069 \pm 0.00001\)
\(0.92242 \pm 0.02867\)
\(2.751 \pm 0.017\)
\(2.791 \pm 0.013\)
5
88
\(0.102 \pm 0.051\)
\(0.00092 \pm 0.00002\)
\(0.00069 \pm 0.00002\)
\(0.95418 \pm 0.00795\)
\(2.790 \pm 0.020\)
\(2.810 \pm 0.022\)
7
112
\(0.086 \pm 0.002\)
\(0.00096 \pm 0.00003\)
\(0.00075 \pm 0.00019\)
\(1.09881\pm 0.02166\)
\(2.811 \pm 0.029\)
\(2.831 \pm 0.032\)
7
125
\(0.091 \pm 0.011\)
\(0.00094 \pm 0.00002\)
\(0.00069 \pm 0.00001\)
\(1.17021 \pm 0.03387\)
\(2.824 \pm 0.021\)
\(2.854 \pm 0.026\)
Table 12
Execution times obtained by the models using pre-trained vectors. (in seconds)
Word2Vec
GloVe
fastText
RoBERTa
 
50
100
200
300
pre-trained
online
pre-trained
online
Training Time
  
-
-
-
-
-
-
\(606.89 \pm 4.04\)
-
\(88445.2 \pm 357.3\)
Prediction Time (For both datasets)
  
  0.00090 ± 0.00002
  0.00065 ± 0.00001
  0.00064 ± 0.00002
 0.00066 ± 0.00002
 0.00065 ± 0.00002
  0.00409 ± 0.00008
  3.80 ± 0.26
  3.434 ± 0.029
  3.434 ± 0.027
Table 13
Space in disk occupied by the models trained from scratch. (in Meagbytes)
Context window size
Embedding vector size
LSI
Word2Vec
GloVe
fastText
RoBERTa - Corpus
RoBERTa - Task
3
43
52.72
15.44
187.44
490.82
11.16
11.0
3
50
52.72
15.44
217.34
569.28
14.36
14.3
5
75
52.72
15.44
336.18
883.13
20.86
20.7
3
88
52.72
15.44
378.89
995.22
24.15
24.0
7
112
52.72
15.44
481.42
1264.23
30.81
30.7
3
125
52.72
15.44
536.12
1409.95
34.18
34.1
Table 14
Space in disk occupied by the models using pre-trained vectors. (in Meagbytes)
Word2Vec
GloVe
fastText
RoBERTa
 
50
100
200
300
pre-trained
online
pre-trained
online
1662.79
163.41
331.04
661.31
989.88
2154.43
5455.91
481.2
478.1
Literatur
Zurück zum Zitat Abdalla, M., Vishnubhotla, K., & Mohammad, S. M. (2021). What makes sentences semantically related: A textual relatedness dataset and empirical study. ArXiv, abs/2110.04845 Abdalla, M., Vishnubhotla, K., & Mohammad, S. M. (2021). What makes sentences semantically related: A textual relatedness dataset and empirical study. ArXiv, abs/2110.04845
Zurück zum Zitat Afzal, M. K., Zikria, Y. B., Mumtaz, S., Rayes, A., Al-Dulaimi, A., & Guizani, M. (2018). Unlocking 5g spectrum potential for intelligent iot: Opportunities, challenges, and solutions. IEEE Communications Magazine, 56(10), 92–93.CrossRef Afzal, M. K., Zikria, Y. B., Mumtaz, S., Rayes, A., Al-Dulaimi, A., & Guizani, M. (2018). Unlocking 5g spectrum potential for intelligent iot: Opportunities, challenges, and solutions. IEEE Communications Magazine, 56(10), 92–93.CrossRef
Zurück zum Zitat Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.CrossRef Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.CrossRef
Zurück zum Zitat Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the 2008 acm sigmod international conference on management of data (p. 1247–1250). Association for Computing Machinery. Retrieved from https://doi.org/10.1145/1376616.1376746 Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the 2008 acm sigmod international conference on management of data (p. 1247–1250). Association for Computing Machinery. Retrieved from https://​doi.​org/​10.​1145/​1376616.​1376746
Zurück zum Zitat Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pretraining of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1423. Accessed 30 Nov 2021. https://doi.org/10.18653/v1/N19-1423 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pretraining of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. Retrieved from https://​aclanthology.​org/​N19-1423. Accessed 30 Nov 2021. https://​doi.​org/​10.​18653/​v1/​N19-1423
Zurück zum Zitat Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. Retrieved from http://arxiv.org/abs/1907.11692 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. Retrieved from http://​arxiv.​org/​abs/​1907.​11692
Zurück zum Zitat Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Proceedings of the 21st national conference on artificial intelligence (vol. 1, p. 775–780). Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Proceedings of the 21st national conference on artificial intelligence (vol. 1, p. 775–780).
Zurück zum Zitat Mukhamediev, R. I., Aliguliyev, R. M., & Muhamedijeva, J. (2017). Estimation of relationship between domains of ict semantic network. D.A. Alexandrov, A.V. Boukhanovsky, A.V. Chugunov, Y. Kabanov, O. Koltsova (Eds.), Digital transformation and global society (pp. 130–135). Springer International Publishing. Mukhamediev, R. I., Aliguliyev, R. M., & Muhamedijeva, J. (2017). Estimation of relationship between domains of ict semantic network. D.A. Alexandrov, A.V. Boukhanovsky, A.V. Chugunov, Y. Kabanov, O. Koltsova (Eds.), Digital transformation and global society (pp. 130–135). Springer International Publishing.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.), Advances in neural information processing systems (vol. 30). Curran Associates, Inc. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.), Advances in neural information processing systems (vol. 30). Curran Associates, Inc.
Metadaten
Titel
Comparison of Semantic Similarity Models on Constrained Scenarios
verfasst von
Rafael Teixeira
Mário Antunes
Diogo Gomes
Rui L. Aguiar
Publikationsdatum
09.11.2022
Verlag
Springer US
Erschienen in
Information Systems Frontiers / Ausgabe 4/2024
Print ISSN: 1387-3326
Elektronische ISSN: 1572-9419
DOI
https://doi.org/10.1007/s10796-022-10350-w