Skip to main content

Annals of Data Science OnlineFirst articles

Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

  • Open Access
  • Original Article

Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to …

Identifying the Intents Behind Website Visits by Employing Unsupervised Machine Learning Models

  • Open Access
  • Review Article

With digitisation globally on the rise, corporates are compelled to better understand the usage of their websites. In doing so, corporates will be empowered to better understand consumers, and make necessary adjustments to ultimately improve the …

The Use of Data Mining in the Management of the Career Guidance Work of the University

  • Original Article

This study explores the application of data mining techniques to analyse factors influencing university choice and predict enrolment trends in Kazakhstan. For this purpose, methods of analysis (multiple correlation and regression analysis, factor …

Sentiment-Based Hierarchical Deep Learning Framework Using Hybrid Optimization for Course Recommendation in E-learning

  • Original Article

Course recommendation (CD) is essential for success in a student’s educational journey. Due to the variations in student’s knowledge system, it might be difficult to select the course content from online educational platforms. This problem is …

Generalized Alpha Power Inverted Weibull Distribution: Application of Air Pollution in Kathmandu, Nepal

  • Original Article

A novel probability distribution, the Generalized Alpha Power Inverted Weibull (GAPIW) distribution, is derived from the generalization of the $$\alpha$$ α -power family and compounded with the inverted Weibull distribution. The researchers looked …

On the xgamma k-record values and associated inference

  • Original Article

The xgamma distribution was first introduced by Sen et al. [1] as an alternative distribution to the exponential model. The xgamma distribution exhibits a bathtub-shaped hazard rate function, so it is suitable for many lifetime phenomena. In this …

Deep Enhancement in Supplychain Management with Adaptive Serial Cascaded Autoencoder with Long Short Term Memory and Multi-layered Perceptron Framework

  • Original Article

Recognizing and reducing risk is a major part of Supply Chain Management (SCM). Several companies are invested in Supply Chain Risk Management (SCRM) and they have the knowledge about the procurement occupancies within their companies and take …

Statistical Data-Driven Modelling and Forecasting: An Application to COVID-19 Pandemic

  • Original Article

One of the key objectives of statistics is to provide a model compatible with the data generated by an unknown random process. Often, it happens that the unknown process is intractable, and no prior data or information associated with the unknown …

Sentiment Analysis of Hate Speech on Women in Social Media Platform Using Multi Label Classification

  • Original Article

We live in a world where everything is connected to online social media platforms, and the person uses social media networks like Face book, Twitter, Instagram, Whatsapp, etc. In the present scenario, working women, celebrities, sports persons …

Beyond Regular SPC: Bridging the Capability Index for (a)Symmetric Data

  • Original Article

The advancement of technology has increased competitiveness, especially in the manufacturing industry. Alongside Statistical Process Control (SPC), capacity indices are tools used to measure the quality of processes and are useful for establishing …

Modeling and Analysis of Trading Volume and Stock Return Data Using Bivariate q-Gaussian Distribution

  • Original Article

Two known characteristics of the distribution of stock returns (price fluctuations) and, more recently, the distribution of financial asset volumes are power laws and scaling. These power laws can be viewed as the asymptotic behaviour of …

A Novel Finite Mixture Model Based on the Generalized t Distributions with Two-Sided Censored Data

In light of the rapid technological advancements witnessed in recent decades, numerous disciplines have been inundated with voluminous datasets characterized by multimodality, heavy-tailed distributions, and prevalent missing information.

Exploring the Potential of the Kumaraswamy Discrete Half-Logistic Distribution in Data Science Scanning and Decision-Making

Data science often employs discrete probability distributions to model and analyze various phenomena. These distributions are particularly useful when dealing with data that can be categorized into distinct outcomes or events. This study presents …

Determining the Correlation among the Users' Satisfaction and Familiarity with Malay Entrepreneurs Food Delivery Mobile Applications in Malaysia

The rise of mobile technology has significantly transformed numerous aspects of our everyday lives, especially within food delivery services. The investigation aims to explore the food delivery mobile apps (FDMA) satisfaction (SAT) and the …

Designing Supply Chain Management Pattern in Small Scale Integrated Commercial Agriculture

This paper has investigated an empirical study to consider the impact of supply chain management on small scale integrated commercial agriculture by focusing on the moderator role of impediments and obligations to offer solutions for agricultural …

The Modified Lindley Distribution Through Convex Combination with Applications in Engineering

This paper introduces a Modified Lindley distribution using a convex combination of exponential and gamma distribution. The fundamental properties of the proposed distribution such as the shapes of the distribution, moments, mean, variance …

Gated Graph Attention-based Crossover Snake (GGA-CS) Algorithm for Hyperspectral Image Classification

Hyperspectral image classification involves assigning pixels or regions within a hyperspectral image to specific classes or categories based on the spectral information captured across multiple bands. Traditional method faces several challenges …

Kernel-free Reduced Quadratic Surface Support Vector Machine with 0-1 Loss Function and L-norm Regularization

This paper presents a novel nonlinear binary classification method, namely the kernel-free reduced quadratic surface support vector machine with 0-1 loss function and L $$_{p}$$ p -norm regularization (L $$_p$$ p -RQSSVM $$_{0/1}$$ 0 / 1 ). It …

An Empirical Study of Nature-Inspired Algorithms for Feature Selection in Medical Applications

Nature-inspired algorithms (NIA) are proven to be the potential tool for solving intricate optimization problems and aid in the development of better computational techniques. In recent years, these algorithms have raised considerable interest to …

Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data

Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However …

Non-negative Sparse Matrix Factorization for Soft Clustering of Territory Risk Analysis

Developing effective methodologies for territory design and relativity estimation is crucial in auto insurance rate filings and reviews. This study introduces a novel approach utilizing fuzzy clustering to enhance the design process of territories …

The Effect of Company Size, Profitability, Leverage, Media Exposure, and Liquidity on Carbon Emissions Disclosure

Carbon emissions disclosure (CED) has become a pivotal aspect of corporate sustainability efforts, reflecting a company’s commitment to environmental responsibility and accountability. This study delves into the complex connection between CED and …

Partial Label Learning with Noisy Labels

Partial label learning (PLL) is a particular problem setting within weakly supervised learning. In PLL, each sample corresponds to a candidate label set in which only one label is true. However, in some practical application scenarios, the …

Kernel Method for Estimating Matusita Overlapping Coefficient Using Numerical Approximations

In this paper, a nonparametric kernel method is introduced to estimate the well-known overlapping coefficient, Matusita $$\rho (X,Y)$$ ρ ( X , Y ) , between two random variables $$X$$ X and $$Y$$ Y . Due to the complexity of finding the formula …

Maximum Likelihood Estimation for Generalized Inflated Power Series Distributions

In this paper we first define the class of Generalized Inflated Power Series Distributions (GIPSDs) which contain the inflated discrete distributions most often seen in practice as special cases. We describe the hitherto unkown exponential family …

Farm-Level Smart Crop Recommendation Framework Using Machine Learning

Agriculture is the primary source of food, fuel, and raw materials and is vital to any country’s economy. Farmers, the backbone of agriculture, primarily rely on instinct to determine what crops to plant in any given season. They are comfortable …

A Human Word Association Based Model for Topic Detection in Social Networks

With the widespread use of social networks, detecting the topics discussed on these platforms has become a significant challenge. Current approaches primarily rely on frequent pattern mining or semantic relations, often neglecting the structure of …

Transmuted Shifted Lindley Distribution: Characterizations, Classical and Bayesian Estimation with Applications

In this article, we propose the quadratic rank transmutation map approach on shifted Lindley distribution to improve the existing distribution further. An additional skewness parameter $$\lambda $$ λ is incorporated to transmute the distribution.

Apple Leaf Disease Detection Using Transfer Learning

Automated detection of plant diseases is crucial as it simplifies the task of monitoring large farms and identifies diseases at their early stages to mitigate further plant degradation. Besides the decline in plant health, reduced production …

Representing a Model for the Anonymization of Big Data Stream Using In-Memory Processing

In light of the escalating privacy risks in the big data era, this paper introduces an innovative model for the anonymization of big data streams, leveraging in-memory processing within the Spark framework. The approach is founded on the principle …

A Review of Anonymization Algorithms and Methods in Big Data

In the era of big data, with the increase in volume and complexity of data, the main challenge is how to use big data while preserving the privacy of users. This study was conducted with the aim of finding a solution to this challenge. In this …

Analyzing Insurance Data with an Alpha Power Transformed Exponential Poisson Model

In this paper, we propose a new model by adding an additional parameter to the baseline distributions for modeling claim and risk data used in actuarial and financial studies. The new model is called alpha power transformed exponential Poisson …

Drinkers Voice Recognition Intelligent System: An Ensemble Stacking Machine Learning Approach

Alcohol's dehydrating effects can cause vocal cords to dry out, potentially causing temporary voice changes and increasing the risk of vocal strain or damage. Short-term changes in pitch, volume, and alcohol consumption can cause voice clarity …

A New Kernel Density Estimation-Based Entropic Isometric Feature Mapping for Unsupervised Metric Learning

Metric learning consists of designing adaptive distance functions that are well-suited to a specific dataset. Such tailored distance functions aim to deliver superior results compared to standard distance measures while performing machine learning …

Power Evaluation of Some Tests for Inverse Rayleigh Distribution

The Inverse Rayleigh distribution has many applications in the area of reliability studies. It is regarded as a model for a lifetime random variable. It is essential to develop an efficient goodness-of-fit test for this distribution. In this …

Visual Question Answer System for Skeletal Image Using Radiology Images in the Healthcare Domain Based on Visual and Textual Feature Extraction Techniques

The Medical Imaging Query Response System is among the most challenging concepts in the medical field. It requires a significant amount of effort to organize and comprehend the various representations of the human body. Additionally, the system …

Combining LASSO-type Methods with a Smooth Transition Random Forest

In this work, we propose a novel hybrid method for the estimation of regression models, which is based on a combination of LASSO-type methods and smooth transition (STR) random forests. Tree-based regression models are known for their flexibility …

A Comprehensive Survey of Image Generation Models Based on Deep Learning

In recent years, generative artificial intelligence has been developing rapidly. In the image domain, image generation models based on deep learning have made remarkable achievements. Early frameworks for image generation models were dominated by …

Classification of Privacy Preserved Medical Data with Fractional Tuna Sailfish Optimization Based Deep Residual Network in Cloud

Nowadays, with the growth of emerging technologies, increased attention has been paid to the classification of privacy-preserved medical data and development of various privacy-preserving models for the promotion of online medical pre-diagnosis …

A Two-Stage Analysis of Interaction Between Stock and Exchange Rate Markets: Evidence from Turkey

In this study, we use a novel approach to explore possible connections between foreign exchange and stock returns using Turkish financial data from 2005 to 2022. Our method involves a two-stage technique. The first stage begins by decomposing …

A Comprehensive Study and Research Perception towards Secured Data Sharing for Lung Cancer Detection with Blockchain Technology

Modernization in the healthcare industry is happening with the support of artificial intelligence and blockchain technologies. Collecting healthcare data is done through any Google survey from different governing bodies and data available on the …

Improving Dementia Prediction Using Ensemble Majority Voting Classifier

Early detection of dementia patients in advance is a great concern for the physicians. That is why physicians make use of multi modal data to accomplish this. The baseline visit data of the patients are mainly utilized for this task. Modern …

Real Estate Market Prediction Using Deep Learning Models

Real estate significantly contributes to the broader stock market and garners substantial attention from individual households to the overall country’s economy. Predicting real estate trends holds great importance for investors, policymakers, and …

Analysis of the HIV/AIDS Data Using Joint Modeling of Longitudinal (k,l)-Inflated Count and Time to Event Data in Clinical Trials

Generalized linear mixed effect models (GLMEMs) are widely applied for the analysis of correlated non-Gaussian data such as those found in longitudinal studies. On the other hand, the Cox (proportional hazards, PHs) and the accelerated failure …

Omega —Type Probability Models: A Parametric Modification of Probability Distributions

A mathematical approach to developing new distributions is reviewed. The method which composes of integration and the concept of a normalizing constant, allows for primitive interjection of new parameter(s) in an existing distribution to form new …

A Survey of Artificial Intelligence for Industrial Detection

In the past decade, deep learning has greatly increased the complexity of industrial production intelligence by virtue of its powerful learning capability. At the same time, it has also brought security challenges to the field of industrial …

A Deep Convolutional Neural Network-Based Approach for Visual Search & Recommendation of Grocery Products

Search and recommendation are two essential features of any e-commerce website for finding and purchasing a specific product. Visual Search is a promising and quick method in comparison to a textual-based search method. Hence, the objective of …

Combining Nonlinear Features of EEG and MRI to Diagnose Alzheimer’s Disease

This article, a new method for the diagnosis of Alzheimer’s disease in the mild stage is presented according to combining the characteristics of EEG signal and MRI images. The brain signal is recorded in four modes of closed-eyes, open eye …

Evaluating the Performance of Machine Learning Algorithm for Classification of Safer Sexual Negotiation among Married Women in Bangladesh

Safer sexual practice is essential for improving women’s reproductive and sexual health outcomes. The goal of this study is to identify the contributing factors influencing safer sexual negotiations (SSN) through the application of machine …

Half Logistic Generalized Rayleigh Distribution for Modeling Hydrological Data

This article introduced a three-parameter extension of the Generalized Rayleigh distribution called half-logistic Generalized Rayleigh distribution, which has submodels the Generalized Rayleigh and Rayleigh distribution. The proposed model is …

An Improved Boosting Bald Eagle Search Algorithm with Improved African Vultures Optimization Algorithm for Data Clustering

Data clustering is one of the main issues in the optimization problem. It is the process of clustering a group of items into several groups. Items within each group have the greatest similarity and the least similarity to things in other groups.

One-Inflated Zero-Truncated Poisson Distribution: Statistical Properties and Real Life Applications

Agriculture, engineering, public health, sociology, psychology, and epidemiology are just few of the numerous disciplines that find analysis and modeling of zero-truncated count data to be of paramount importance. Very recently, researchers have …

Optimal Strategy for Elevated Estimation of Population Mean in Stratified Random Sampling under Linear Cost Function

In this paper, we propose the exponential ratio-type estimator for the elevated estimation of population mean, implying one auxiliary variable in stratified random sampling using the conventional ratio and, Bahl and Tuteja exponential ratio-type …

Optimal Key Generation for Privacy Preservation in Big Data Applications Based on the Marine Predator Whale Optimization Algorithm

In the era of big data, preserving data privacy has become paramount due to the sheer volume and sensitivity of the information being processed. This research is dedicated to safeguarding data privacy through a novel data sanitization approach …

Semiparametric Regression Analysis of Panel Count Data with Multiple Modes of Recurrence

Panel count data refers to the information collected in studies focusing on recurrent events, where subjects are observed only at specific time points. If these study subjects are exposed to recurrent events of several types, we obtain panel count …

Applying BERT-Based NLP for Automated Resume Screening and Candidate Ranking

In this research, we introduce an innovative automated resume screening approach that leverages advanced Natural Language Processing (NLP) technology, specifically the Bidirectional Encoder Representations from Transformers (BERT) language model …

A Joint Cognitive Latent Variable Model for Binary Decision-making Tasks and Reaction Time Outcomes

Traditionally, in cognitive modeling for binary decision-making tasks, stochastic differential equations, particularly a family of diffusion decision models, are applied. These models suffer from difficulties in parameter estimation and …

A New Hyperbolic Tangent Family of Distributions: Properties and Applications

This paper introduces a new family of distributions called the hyperbolic tangent (HT) family. The cumulative distribution function of this model is defined using the standard hyperbolic tangent function. The fundamental properties of the …

Assessing the Risk of Bitcoin Futures Market: New Evidence

  • Open Access

The main objective of this paper is to forecast the realized volatility (RV) of Bitcoin futures (BTCF) market. To serve our purpose, we propose an augmented heterogenous autoregressive (HAR) model to consider the information on time-varying jumps …

An Innovative Technique for Generating Probability Distributions: A Study on Lomax Distribution with Applications in Medical and Engineering Fields

In this paper, we propose and investigate a novel approach for generating the probability distributions. The novel method is known as the SMP transformation technique. By using the SMP Transformation technique, we have developed a new model of the …

Parameter Estimation for Geometric Lévy Processes with Constant Volatility

In finance, various stochastic models have been used to describe price movements of financial instruments. Following the seminal work of Robert Merton, several jump-diffusion models have been proposed for option pricing and risk management. In …

On Modeling Bivariate Lifetime Data in the Presence of Inliers

Many items fail instantaneously or early in life-testing experiments, mainly in electronic parts and clinical trials, due to faulty construction, inferior quality, or non-response to treatments. We record the observed lifetime as zero or near …

Bayesian Estimation of Stress Strength Modeling Using MCMC Method Based on Outliers

In reliability literature and engineering applications, stress-strength (SS) models are particularly important. This paper aims to estimate the SS reliability for an inverse Weibull distribution having the same shape parameters but different scale …