Introduction
-
A more comprehensive range of unsupervised dimension reduction techniques is included, from linear and nonlinear feature extraction to feature selection and manifold techniques.
-
The effect of these techniques on pre-trained and fine-tuned pre-computed embeddings in the Semantic Textual Similarity (STS) task is explored.
-
In contrast to previous work that explored reduction techniques in classical static word embeddings, this paper investigates the effect of dimension reduction in state-of-the-art contextual-based transformer models.
-
Unlike previous works focused on the English language, this research analyses multilingual models, overcoming the language bottleneck for the applicability of dimension reduction of the embeddings of these models.
Related Work
Dimensionality Reduction Techniques
Dimensional Reduction of Embeddings
Siamese and Non-Siamese Architectures
Importance of Multilingual Semantics
Methodology
Dimensionality Reduction Techniques
-
Principal Component Analysis (PCA): Principal Component Analysis [45, 46] is a powerful unsupervised linear feature extraction technique that computes a set of orthogonal directions from the covariance matrix that capture most of the variance in the data [62]. This is, it creates new uncorrelated variables that maximise variance, and at the same time, most existing structure in the data is retained. It is also important to note that this research uses a variant of PCA known as Incremental Principal Components Analysis (IPCA) [63]. This variant follows the same basic principles as PCA. However, it is much more memory efficient, as it applies PCA in batches, avoiding storing entire data in memory and allowing PCA to be applied on large datasets.
-
Independent Component Analysis (ICA) [64]: Independent Component Analysis is an unsupervised feature extraction probabilistic method for learning a linear transformation to find components that are maximally independent between them and non-Gaussian (non-normal), but at the same time, they jointly maximise mutual information with the original feature space.
-
Kernel Principal Components Analysis (KPCA) [65]: Kernel-based learning method for PCA. It uses kernel functions to construct a nonlinear version of the PCA linear algorithm by first implicitly mapping the data into a nonlinear feature space and then performing linear PCA on the mapped patterns [62]. The kernels considered in this project are the Polynomial, Gaussian RBF, Hyperbolic Tangent (Sigmoid), and Cosine kernels.
-
Variance Threshold: Unsupervised feature selection approach that removes all features with a variance below a threshold. Indeed, this technique selects a subset of features with large variances, considered more informative, without considering the desired outputs.
-
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP): The authors of UMAP [66] describe it as an algorithm that can be used for unsupervised dimension reduction based on manifold learning techniques and topological data analysis. In short, it first embeds data points in a new nonlinear fuzzy topological representation using neighbour graphs. Secondly, it learns a low-dimensional representation that preserves the complete information of this space, minimising Cross-Entropy. Compared to its counterparts, such as t-SNE, UMAP is fast, scalable, and allows better control of the desired balance between the local and global structure to be preserved. Two main parameters play a vital role in controlling this: (1) the number of sample points that defines a local neighbourhood in the first step, and (2) the minimum distance between embedded points in low-dimensional space to be clustered in the second step. Larger values of the number of neighbours tend to preserve more global information in the manifold as UMAP has to consider more prominent neighbourhoods to embed a point. Likewise, larger minimum distance values prevent UMAP from packing points together and preserving the overall topological structure.
PCA | KPCA | ICA | Variance Threshold | UMAP | |
---|---|---|---|---|---|
Preprocessor | Standard | Standard | MinMax | MinMax | |
Scalation | ✘ | ✘ | ✘ | ✘ | ✘ |
Normalisation | ✘ | ✘ | |||
Unsupervised | ✘ | ✘ | ✘ | ✘ | ✘ |
Feature Selection | ✘ | ||||
Feature Extraction | ✘ | ✘ | ✘ | ✘ | |
Linear | ✘ | ✘ | |||
Non Linear | ✘ | ✘ |
Technique | Parameters |
---|---|
ICA | random_state = 0 |
max_iter = 320 | |
whiten = True | |
tol = 5e-4 | |
KPCA | kernels = [sigmoid, polynomial, rbf, cosine] |
eigen_solver = arpack | |
copy_X = False | |
random_state = 0 | |
Variance Threshold | threshold = [Min, Max, Decile of variance] |
UMAP | pre-computed_knn = True |
metric = cosine | |
min_dist = 1 | |
n_neighbors = [5, 10, 50, 100, 125] | |
angular_rp_forest = True |
Multilingual Models
-
bert-base-multilingual-cased: BERT [4] transformer model pre-trained on a large corpus of 104 languages Wikipedia articles using the self-supervised masked language modelling (MLM) objective with \(\sim\)177M parameters.
-
distilbert-base-multilingual-cased: Distilled version of the previous model, being on average twice as fast as this model, totalizing \(\sim\)134M parameters [69].
-
LaBSE: Language-agnostic BERT Sentence Embedding [72] model trained for encoding and reducing the cosine distance between translation pairs with a siamese architecture based on BERT, a task related to semantic similarity. It trained over 6 billion translation pairs for 109 languages. The authors also reported that it has zero-shot capabilities, producing decent results for other not seen languages.
Evaluation Approaches
-
Approach 1 — Pre-trained models. In the first approach, we employ and directly evaluate the pre-trained models in the mSTSb test split without applying any dimensionality reduction. This approach is used as the baseline for Approach 3.
-
Approach 2 — Fine-tuned models. In this second approach, the pre-trained models are fine-tuned downstream using the mSTSb train split and evaluated in the mSTSb test split without applying any dimensionality reduction technique. This approach is used as the baseline for Approach 4. The fine-tuning process will be discussed in more detail in the “Transformers Fine-tuning’’ section.
-
Approach 3 — Reduced embeddings from pre-trained models. In this approach, the embeddings generated by the pre-trained models from Approach 1 in the mSTSb train split are used to fit the different dimension reduction techniques and evaluate them in the mSTSb test split. Thus, an analysis between the results achieved in Approach 1 and Approach 3 will help to understand the impact of dimensionality reduction techniques in the embeddings from pre-trained models.
-
Approach 4 — Reduced embeddings from fine-tuned models. This approach is equivalent to Approach 3 but uses the fine-tuned models in Approach 2, allowing us to assess the impact of dimensionality reduction techniques in fine-tuned embeddings.
Experimental Setup
Data
Computational Resources
Baseline Approaches: Approach 1 and Approach 2
Transformers Fine-tuning
Dimensionality Reduced Techniques Fitting: Approach 3 and Approach 4
Statistical Comparison
Model | Ap. 1 \(r_s\) | Best Technique | Dimensions | Ap. 3 \(r_s\) | Fitting Time |
---|---|---|---|---|---|
bert-base-multilingual-cased | 0.4342 | ICA | 209 | 0.5019 | 4 m 16 s |
distilbert-base-multilingual-cased | 0.4531 | ICA | 169 | 0.523 | 2 m 47 s |
xlm-roberta-base | 0.3274 | ICA | 249 | 0.5269 | 7 m 51 s |
xlm-roberta-large | 0.2855 | ICA | 1024 | 0.5392 | 31 m 22 s |
LaBSE | 0.7096 | ICA | 129 | 0.7488 | 2 m 27 s |
Model | Ap. 2 \(r_s\) | Best Technique | Dimensions | Ap. 4 \(r_s\) | Fitting Time |
---|---|---|---|---|---|
bert-base-multilingual-cased-fine-tuned | 0.7045 | ICA | 568 | 0.7117 | 12 m 38 s |
distilbert-base-multilingual-cased-fine-tuned | 0.6863 | VarThres | 692 | 0.6842 | 2 s |
xlm-roberta-base-fine-tuned | 0.7470 | VarThres | 673 | 0.7495 | 3 s |
xlm-roberta-large-fine-tuned | 0.8150 | KPCA-sigmoid | 1024 | 0.8176 | 20 m 6 s |
LaBSE-fine-tuned | 0.8242 | KPCA-sigmoid | 768 | 0.8243 | 19 m 25 s |
Results
Approach 1 vs Approach 3: Dimensionality Reduction in Pre-trained Embeddings
Model (Ap. 1 Avg \(r_s\)) | Technique | Threshold Performance Retained | Dimensions (% reduction) | Ap. 3 Avg \(r_s\) | Fitting Time |
---|---|---|---|---|---|
bert-base-multilingual-cased (0.4342) | IPCA | 100% | 209 (73%) | 0.4251 | 27 s |
ICA | 100% | 89 (88%) | 0.4779 | 1 m 10 s | |
poly | 95% | 249 (68%) | 0.4130 | 49 s | |
rbf | 95% | 448 (42%) | 0.4138 | 1 m 38 s | |
sigmoid | 100% | 129 (83%) | 0.4425 | 40 s | |
cosine | 100% | 209 (73%) | 0.4350 | 36 s | |
UMAP | 50% | 129 (83%) | 0.2176 | 35 s | |
VarThres | 85% | 161(79%) | 0.3727 | 2 s | |
distilbert-base-multilingual-cased (0.4531) | IPCA | 100% | 209 (73%) | 0.4553 | 38 s |
ICA | 100% | 49 (94%) | 0.4564 | 43 s | |
poly | 95% | 369 (52%) | 0.4310 | 1 m 6 s | |
rbf | 95% | 608 (21%) | 0.4305 | 2 m 13 s | |
sigmoid | 100% | 129 (83%) | 0.4642 | 38 s | |
cosine | 100% | 209 (73%) | 0.4537 | 33 s | |
UMAP | 40% | 49 (94%) | 0.3942 | 18 s | |
VarThres | 95% | 238 (69%) | 0.438 | 2 s | |
xlm-roberta-base (0.3274) | IPCA | 100% | 89 (88%) | 0.3711 | 38 s |
ICA | 100% | 49 (94%) | 0.4043 | 56 s | |
poly | 100% | 129 (83%) | 0.3439 | 25 s | |
rbf | 100% | 129 (83%) | 0.3439 | 50 s | |
sigmoid | 100% | 49 (94%) | 0.3425 | 29 s | |
cosine | 100% | 89 (88%) | 0.3709 | 15 s | |
UMAP | 40% | 10 (99%) | 0.1320 | 15 s | |
VarThres | 100% | 52 (93%) | 0.3310 | 2 s | |
xlm-roberta-large (0.2885) | IPCA | 100% | 116 (89%) | 0.3149 | 1 m 37 s |
ICA | 100% | 63 (94%) | 0.3642 | 1 m 52 s | |
poly | 100% | 223 (78%) | 0.2927 | 45 s | |
rbf | 100% | 276 (73%) | 0.2934 | 1 m 8 s | |
sigmoid | 100% | 63 (94%) | 0.2927 | 32 s | |
cosine | 100% | 116 (89%) | 0.3191 | 25 s | |
UMAP | 45% | 10 (99%) | 0.1365 | 15 s | |
VarThres | 100% | 598 (42%) | 0.2917 | 3 s | |
LaBSE (0.7096) | IPCA | 100% | 129 (83%) | 0.7251 | 37 s |
ICA | 100% | 89 (88%) | 0.7431 | 1 m 35 s | |
poly | 100% | 169 (78%) | 0.7181 | 34 s | |
rbf | 100% | 408 (47%) | 0.7106 | 1 m 35 s | |
sigmoid | 100% | 89 (88%) | 0.7127 | 34 s | |
cosine | 100% | 129 (83%) | 0.7232 | 21 s | |
UMAP | 70% | 10 (99%) | 0.5026 | 33 s | |
VarThres | 85% | 217 (72%) | 0.6148 | 2 s |
Model (Ap. 2 Avg \(r_s\)) | Technique | Threshold Performance Retained | Dimensions (% reduction) | Ap. 4 Avg \(r_s\) | Fitting Time |
---|---|---|---|---|---|
bert-base-multilingual-cased-fine-tuned (0.7045) | IPCA | 95% | 49 (94%) | 0.6710 | 34 s |
ICA | 100% | 169 (78%) | 0.7047 | 3 m 37 s | |
poly | 95% | 129 (83%) | 0.6716 | 31 s | |
rbf | 95% | 169 (78%) | 0.6738 | 54 s | |
sigmoid | 100% | 329 (57%) | 0.7048 | 1 m 34 s | |
cosine | 95% | 49 (94%) | 0.6707 | 11 s | |
UMAP | 70% | 10 (99%) | 0.5398 | 32 s | |
VarThres | 100% | 393 (53%) | 0.7046 | 2 s | |
distilbert-base-multilingual-cased-fine-tuned (0.6863) | IPCA | 95% | 49 (94%) | 0.6533 | 35 s |
ICA | 95% | 49 (94%) | 0.6556 | 56 s | |
poly | 95% | 129 (83%) | 0.6542 | 30 s | |
rbf | 95% | 129 (83%) | 0.6520 | 49 s | |
sigmoid | 95% | 49 (94%) | 0.6601 | 29 s | |
cosine | 95% | 89 (88%) | 0.6631 | 16 s | |
UMAP | 75% | 10 (99%) | 0.5189 | 25 s | |
VarThres | 95% | 66 (91%) | 0.6620 | 2 s | |
xlm-roberta-base-fine-tuned (0.7470) | IPCA | 95% | 49 (94%) | 0.7198 | 36 s |
ICA | 95% | 49 (94%) | 0.7208 | 59 s | |
poly | 95% | 89 (88%) | 0.7112 | 25 s | |
rbf | 95% | 129 (83%) | 0.7134 | 49 s | |
sigmoid | 100% | 289 (62%) | 0.7472 | 1 m 45 s | |
cosine | 95% | 49 (94%) | 0.7195 | 11 s | |
UMAP | 75% | 10 (99%) | 0.5724 | 31 s | |
VarThres | 100% | 411 (46%) | 0.7491 | 3 s | |
xlm-roberta-large-fine-tuned (0.8150) | IPCA | 95% | 63 (94%) | 0.7910 | 51 s |
ICA | 95% | 63 (94%) | 0.7950 | 1 m 16 s | |
poly | 95% | 63 (94%) | 0.7774 | 23 s | |
rbf | 95% | 63 (94%) | 0.7760 | 42 s | |
sigmoid | 100% | 223 (78%) | 0.8151 | 50 s | |
cosine | 95% | 63 (94%) | 0.7916 | 13 s | |
UMAP | 80% | 10 (99%) | 0.6584 | 38 s | |
VarThres | 95% | 95 (91%) | 0.7936 | 3 s | |
LaBSE-fine-tuned (0.8242) | IPCA | 95% | 89 (88%) | 0.8014 | 34 s |
ICA | 95% | 89 (88%) | 0.7986 | 2 m 3 s | |
poly | 95% | 89 (88%) | 0.7898 | 25 s | |
rbf | 95% | 129 (83%) | 0.7932 | 47 s | |
sigmoid | 100% | 728 (5%) | 0.8243 | 7 m 11 s | |
cosine | 95% | 89 (88%) | 0.8001 | 21 s | |
UMAP | 80% | 23 (97%) | 0.6640 | 35 s | |
VarThres | 95% | 227 (70%) | 0.7964 | 2 s |
Approach 2 vs Approach 4: Dimensionality Reduction in Fine-tuned Embeddings
Model | Technique | Time Min Dimension | Time In-between Dimension | Time Max Dimension |
---|---|---|---|---|
bert-base-multilingual-cased | IPCA | 27 s (10) | 29 s (448) | 30 s (768) |
ICA | 53 s (10) | 8 m 43 s (448) | 52 m 48 s (768) | |
poly | 18 s (10) | 1 m 26 s (448) | 2 m 17 s (768) | |
rbf | 36 s (10) | 1 m 38 s (448) | 3 m 29 s (768) | |
sigmoid | 24 s (10) | 2 m 15 s (448) | 5 m 53 s (768) | |
cosine | 8 s (10) | 55 s (408) | 1 m 24 s (768) | |
UMAP | 20 s (10) | 1 m 30 s (448) | 3 m 29 s (768) | |
VarThres | 2 s (65) | 2 s (516) | 2 s (767) | |
distilbert-base-multilingual-cased | IPCA | 36 s (10) | 40 s (448) | 44 s (768) |
ICA | 29 s (10) | 8 m 30 s (448) | 29 m 31 s (768) | |
poly | 16 s (10) | 1 m 24 s (448) | 2 m 35 s (768) | |
rbf | 35 s (10) | 1 m 44 s (448) | 2 m 57 s (768) | |
sigmoid | 23 s (10) | 3 m 58 s (448) | 15 m 40 s (768) | |
cosine | 8 s (10) | 51 s (448) | 1 m 51 s (768) | |
UMAP | 13 s (10) | 1 m 38 s (448) | 4 m 19 s (768) | |
VarThres | 2 s (66) | 2 s (507) | 2 s (767) | |
xlm-roberta-base | IPCA | 35 s (10) | 35 s (448) | 40 s (768) |
ICA | 56 s (10) | 9 m 54 s (448) | 21 m 9 s (768) | |
poly | 16 s (10) | 1 m 2 s (448) | 2 m 12 s (768) | |
rbf | 33 s (10) | 1 m 38 s (448) | 2 m 2 s (768) | |
sigmoid | 23 s (10) | 1 m 43 s (448) | 3 m 26 s (768) | |
cosine | 7 s (10) | 56 s (448) | 1 m 36 s (768) | |
UMAP | 15 s (10) | 2 m (448s) | 4 m 32 s (768) | |
VarThres | 2 s (1) | 2 s (448) | 2 s (768) | |
xlm-roberta-large | IPCA | 54 s (10) | 54 s (490) | 1 m 46 s (1024) |
ICA | 57 s (10) | 23 m 13 s (490) | 32 m 15 s (1024) | |
poly | 16 s (10) | 1 m 24 s (490) | 3 m 35 s (1024) | |
rbf | 35 s (10) | 1 m 30 s (490) | 4 m 49 s (1024) | |
sigmoid | 23 s (10) | 2 m 58 s (490) | 8 m 28 s (1024) | |
cosine | 9 s (10) | 1 m 38 s (490) | 2 m 41 s (1024) | |
UMAP | 15 s (10) | 1 m 57 s (490) | 7 m 24 s (1024) | |
VarThres | 3 s (10) | 3 s (598) | 3 s (1024) | |
LaBSE | IPCA | 35 s (10) | 36 s (448) | 39 s (768) |
ICA | 36 s (10) | 12 m 55 s (448) | 52 m 39 s (768) | |
poly | 17 s (10) | 1 m 24 s (448) | 2 m 35 s (768) | |
rbf | 35 s (10) | 1 m 41 s (448) | 2 m 50 s (768) | |
sigmoid | 24 s (10) | 4 m 25 s (448) | 19 m 37 s (768) | |
cosine | 8 s (10) | 55 s (448) | 1 m 44 s (768) | |
UMAP | 33 s (10) | 2 m 11 s (448) | 4 m 18 s (768) | |
VarThres | 2 s (94) | 2 s (615) | 2 s (767) |
Model | Time | Hyperparameters |
---|---|---|
bert-base-multilingual-cased-fine-tuned | 39 m 18 s | bach_size: 32 |
lr: 2e-5 | ||
epochs: 2 | ||
scheduler: warmuplinear | ||
warmup_ratio: 0.2 | ||
wieght_decay: 0.2 | ||
distilbert-base-multilingual-cased-fine-tuned | 15 m 31 s | bach_size: 64 |
lr: 2e-5 | ||
epochs: 2 | ||
scheduler: warmuplinear | ||
warmup_ratio: 0.3 | ||
wieght_decay: 0.7 | ||
xlm-roberta-base-fine-tuned | 27 m 49 s | bach_size: 64 |
lr: 5e-5 | ||
epochs: 2 | ||
scheduler: warmuplinear_hard_restarts | ||
warmup_ratio: 0.1 | ||
wieght_decay: 0.5 | ||
xlm-roberta-large-fine-tuned | 1 h 2 m 32 s | bach_size: 64 |
lr: 1e-5 | ||
epochs: 2 | ||
scheduler: warmupcosine | ||
warmup_ratio: 0.2 | ||
wieght_decay: 0 | ||
LaBSE-fine-tuned | 59 m 39 s | bach_size: 32 |
lr: 3e-6 | ||
epochs: 2 | ||
scheduler: warmupcosine | ||
warmup_ratio: 0.1 | ||
wieght_decay: 0.5 |